Tuesday, 26 June 2007

Using Tuning Manager for something useful - Part I

Stephen commented about Tuning Manager and doing something useful with the data. I thought I would use this as an opportunity to highlight some of the things I look at on a regular basis (almost daily, including today in fact). Part I - Write Pending.

First a little bit of background; Write Pending is a measure of unwritten write I/Os which are in cache and haven't been destaged to disk. A USP/9980V (and most other arrays) store write I/O s in cache and acknowledge successful completion to the host immediately. The writes are then destaged to disk asynchronously some time later.

Caching writes (which I could be wrong, could have been invented by IBM and called Cache Fast Write or DASD Fast Write, I can't remember which was which) provides the host with a consistent response and helps to remove the mechanical delay of writing to disk. On average, most I/O will be a mixture of reads and writes and therefore an array can even out the write I/O to provide a more consistent I/O response. I see consistent I/O response as one of the major benefits or features of enterprise arrays.

Tuning Manager collects Write Pending (expressed as a percentage of cache being used for write I/O) in two ways; you can use Performance Reporter and collect real-time data from the agent/command device talking to the array. This can be plotted on a nice graph down to 10 second intervals. You can also collect historical WP data, which is retrieved from the agent on an hourly basis, providing an average and maximum figure. Hourly stats are collected for a maximum of 99 periods (just over 4 days) and can be aggregated into daily, weekly and monthly figures.

A rising and falling WP figure is not a problem and this is to be expected in the normal operation of an array, however WP can reach critical levels at which a USP will perform more aggressive destaging to disk; these figures are 30% and 40%. Once WP reaches 70%, inflow control takes over and starts to elongate response times to host write requests to limit the amount of data coming into the array. This obviously has a major impact on performance. I would look to put monitoring in place within Tuning Manager to alert when WP reaches these figures. Why would an array see high WP levels? Clearly, all the issues are write I/O related but they could include high levels of backup, database loads, file transfers and so on. It could also be an issue with data layout.

If an array can't cope with WP reaching 30%+ then some tuning may be required; consider distributing workload over a wider time period (all backups may have been set up to start at the same time, for instance) and investigate for write activity how that data is laid out; distributing the writes over more LUNs and array groups could be a remedy. It also may be possible that you have to choose faster array group disks for certain workloads and/or purchase additional cache.

I think Tuning Manager could do with a few improvements in terms of how Write Pending data is collected. Firstly, unless I'm running Performance Reporter, I have no idea within an hourly period when the max WP was reached and for how long it stayed at this level. Some kind of indicator would be useful - an hour is a long time to trawl through host data.

I'd also like to know, when the Max WP occurs, which LUNs had the most write I/O activity. LDEV activity is also summarised on the hour so it's not easy to see which LUNs within an hourly interval were those causing the WP issue; as an example, an LDEV could have been extremely active for 10 minute and caused the problem but, on average could have done much less I/O than another LDEV which was consistently busy (with a low I/O rate) for the whole 1 hour period. The averaging over 1 hour has a tendency to mask out peaks.

Finally, I'd like to be able to have triggers which can start more granular collection. So when a critical WP issue occurs, I want to collect 1 minute I/O stats for the next hour to an external file. I think it is possible to trigger a script from an alert, but I'm not clear on whether that trigger occurs at the point the max WP value is reached or whether it is triggered when the interval collection occurs on the hour (by which time I could be 59 minutes too late).

Next time, I'll cover Side-file usage for those who run TrueCopy async.

Monday, 25 June 2007

Responding to Comments

I'm not quite sure what the right way is to respond to posts; If I comment after them, then there's a chance that the responses might be missed. Anyway I will attempt to go back and check all those comments I've not responded to. First, here's a response to Cedric's question, on VMware;

What I was referring to was the way in which (a) VMware lays storage out over LUNs, (b) the way storage systems do failover at a LUN level.

For standard arrays which present LUNs, the lowest unit of granularity for TrueCopy/SRDF etc is the LUN. That's fine for systems that have lots of LUNs and then recombine them at the host level to create volumes. VMware works on a smaller number of larger LUNs (in my experience 50-200GB metas or LUSEs) and then divides this storage across multiple guests. So failover of a LUN could mean failover of a number of hosts. The tradeoff is therefore on how to lay out the storage in order to allow failover to work without affecting multiple hosts and also when additional storage is needed for a host how to present that additional storage.

The issue occurs because SAN storage is presented through the VMware hypervisor. iSCSI LUNs are presented directly to the host and so the original granularity can be retained and iSCSI LUNs can still be replicated. Therefore DR on iSCSI presented LUNs can easily be achieved.

Under the current versions of VMware (and please correct me if I'm wrong) it is possible to boot a guest from an iSCSI LUN and therefore theoretically possible to fail this over to another array. Personally, I'd not do that as I think it would present significant complications to achieve and I'd prefer to just present data LUNs for replication and keep O/S data local.

I hope this clarifies my thinking, feel free to post if you want me to clarify more.

Thursday, 21 June 2007

Hardware Replacement Lifecycle

How do you replace your hardware? I've always thought a sound technology replacement lifecycle is essential. Brand new arrays are great. You can't wait to get them in and start using the hardware and before you know it, they are starting to fill up. If you didn't order a fully populated array (and if we are buying DMX-3, then that's almost a certainty) then you might extend it a few times with more disks, cache and ports.

But then the free maintenance period starts to loom. Nobody wants to think about the hassle of getting out that old array. Maintenance kicks in and before you know it, you're paying charges on hardware you want to replace.

In some respects it's a bit like using a credit card. You tell yourself you absolutely won't use it for credit. You'll pay the balance off each month, but then something comes along that you just must have (like that shiny new plasma screen) and you think; "I can pay that off over a couple of months, no sweat". But then you're hooked and paying interest which you never wanted to pay.

Although it's hard, it is essential to establish a hardware replacement strategy. There are lots of good reasons for shipping old hardware out:

  • you will have to pay maintenance
  • there may be compatibility issues
  • parts may become difficult to locate (e.g. small disk sizes)
  • the vendor may end support
  • power/cooling/density pressures may increase
  • no new features will be developed for non-strategic hardware

So here's my plan. Agree a replacement schedule and stick to it. Firstly you keep your risk exposure to a minimum, but also you'll have the vendors knocking at your door as they will know you are in the market for new equipment. Users of the array can also be notified of the lifecycle and be made aware ahead of time of the refresh cycle. Imagine a 4-year hardware lifetime. An example timeline could be;

  1. Year 0 to the end of Year 3- GREEN: the array is a target for new allocations. The array is a target for maintenance upgrades on a regular basis (say 3-6 months, depending on release cycles)
  2. Year 4, first half - AMBER: the array is still a target for adding storage to existing hosts on the array. No new storage/ports/cache will be added to the array. No new hosts will be added to the array. Maintenance on the standard cycle still continues.
  3. Year 4 second half - RED: the array is a target for active migrations. Servers requiring new storage are forced to move off the array in order to receive their storage. Servers are scheduled to be moved to a new array within the 6 months. No new maintenance other than emergency fixes are to be applied.
  4. Year 4 end - BLUE: the array is empty. Maintenance is terminated. The array is decommissioned and shipped out.

There's an obvious flaw in this policy - new hardware has to be on the floor during the last 6-12 months of the term to make it work. But, in any event, that has to happen anyway; either you get a new array in early, or you pay maintenance on the old hardware as you deploy a new array and migrate to it.

It may be that vendor hardware releases dictate that the 4 year term is shortened or increased to ensure the latest technology can be put on the floor. That could be an issue depending on the terms of maintenance for existing equipment, however getting that sorted is just a case of getting the right commercial arrangement.

What do you do?

Tuesday, 19 June 2007

Device Manager Update

More constructive Device Manager issues are posted here: Device Manager Issues.

HiCommand Export

Has anyone tried to use the HiCommand database transfer command (hcmdsdbtrans)? You can use it to export the contents of a database for later re-importation.

I've been looking at it as a way to get a Device Manager test environment, or to get a configuration between sites where there's no IP connectivity. The command produces a configuration file which as luck would have it, is a zip file. So just run the command and change the extension to ".zip" and you can look at the contents.

There are lots of files in CSV format within the export. These contain all of the expected details, such as the list of hosts, arrays, replication details and so on. There are a few interesting things of note, which I can't find anywhere else in Device Manager, either in the web interface or the CLI. For example, there's a file called licencekey.csv which lists all of the licences installed on the arrays in the database. Another file, host.csv (not surprisingly) lists all of the hosts defined in Device Manager. This seems to contain data I've not been able to locate in other parts of the product, specifically the O/S version for those hosts which have reported back via an agent. It's a shame this information is being collected and not being made available.

One file contains the SQL commands to rebuild the database. This is useful as I can see what data is being stored and how it interrelates.

I believe there is an official API for Device Manager but I've never been able to get a hold of the documentation. Does anyone know if it is a published document?

Monday, 18 June 2007

New Poll - VTL

As VTL has pictured recently, I thought I'd make that the subject of the next poll. It should be up on the right-hand side bar now.

USP Poll

I've closed the USP poll and the results are as follows;

  • It rocks! What a great DMX basher! 30%
  • It's OK, some of the new features seem useful. 15%
  • It looks like just like the old USP but slightly faster 40%
  • I don't see what all the fuss is about... 10%
  • What's a USP-V? 5%

I think this sums up the comments that were flying at the time.

Wednesday, 13 June 2007

ECC 6.0 vs HiCommand 6.0?

EMC recently announced the release of ECC (ControlCenter) version 6.0. This version finally meets customer demands for VMware support (which despite the fact EMC own VMware is a long time coming) plus a whole host of other enhanced reporting.

I did a lot of work recently on virtualisation of storage tools using VMware and this version of ECC was planned not only to be the one which supported virtual guests (i.e. could understand how the VMware ESX Server had taken real storage and presented the guest O/S some of that) but also would be the version to run in a virtualised environment. Previous versions of ECC would work virtualised, however EMC were understandably concerned with performance and any negative impact bad performance of ECC on VMware would create. New new versions of ECC also support new features and functionality like other vendor's arrays (there were issues with the 5.x versions supporting NSC55 correctly). I think the enhanced reporting will prove extremely useful and overall ECC is now a great product.

Compare this to HDS' suite of products, HiCommand, currently at version 5.5 (check here for the compatibility matrix for Device Manager dated November 2006, not including the USP-V, check here for the same for Tuning Manager, quoting version 5.1 and other versions in the text, woefully out of date). A new version of this software will be needed in order to support USP-V and the new functionality. Will it be called 6.0?

Device Manager is still a second rate product at the current version. There are so many fundamental problems with the product. Here are just a few:

  • Poor agent handling. Multiple products still require multiple agents. No O/S information retrieved from host, no indication whether the agent is running. Single fixed host information "push" from the client, no central scheduling for intra-day host collection. No agent push deployment from the server.
  • No integrated (e.g. Active Directory) security.
  • Only 2 storage reports, one of which can't understand dual pathed devices.
  • Poor integration with other software, including CCI.
  • Quirky, inflexible allocation method for assigning storage and for assigning hosts into groups.
  • Lack of consistent feel for the products - the GUI has been made more consistent across the product range but agent installations are to differing locations (and change depending on things such as the version of HDLM installed)
  • Poor documentation
  • Unclear feature definitions (e.g. deletion of a host removes the "host security" - does that actually delete the storage, the HSD, or what?)

This is only a brief list as putting it all here in the blog would be more dull to read than my postings normally are. Also bear in mind Device Manager doesn't have to do that much - it doesn't even support other vendor's hardware. However I made this list to provide a comparison with how far ECC is ahead of HiCommand, bearing in mind they are at similar development levels.

I still think HDS don't get the whole software proposition. I know we all laugh at EMC calling themselves a software company, however at least they are heading in the right direction and excluding the integration issues from the 3 million other software companies they've purchased over the last few years they are turning out good SRM products (caveat, even ECC has issues).

As the storage offerings from HDS get more complicated, we need the tools to be trustworthy, usable, intuitive and unless they start getting there, no-one will want to use thin provisioning and virtualisation because the tools will just not deliver on enabling the productivity savings that those features can achieve.

HDS, pleeeeease tell me version 6.0 will be better? At the same time, you might want to update your website to ensure the compatibility matrix references the latest products.

Tuesday, 12 June 2007

CCI Update

A while back, I commented on how CCI was a pain in the proverbial **** because specifying the replicated LUNs required them to be presented onto a port/HSD. Snig kindly pointed out that although the LDEV needs to be presented, you can now just use the LDEV number rather than the complicated TID/LUN representation previously required.

I've been checking this out today as I've been doing some replication work and want to put together a central replication server which can build disk groups for all possible hosts on an array. Any hoo, I found that you can now specify the LUN number of the device on a port in the HORCM file, if you specify the port and the HSD number - like CL1-C-6 or similar, where 6 is the HSD number. Unfortunately you still need to specify the TID which has to be obtained from RAIDSCAN.

I gleaned this information from "Command Control Interface User and Reference Guide" version MK90RD011-18 (June 2006). In here I still can't see a reference to using LDEVs rather than LUN numbers, which to me, still makes the CCI specification a pile of unmentionables. In fact, I republish here this remarkable paragraph from section 4.20.1 of the said manual:

"The way what CCI has addition of argument for the host group to the raidscan command and
the configuration file will not be able to maintain the compatibility with conventional CLI.
Therefore, CCI adopts a way that supports in the form which specifies a host group in the
port strings as follows."

Now this has me confused - in fact those clever Japanese have already pre-empted me because the previous paragraph tells me how I was about to be confused:

"The USP/NSC and 9900V have the defined host group in the port and are able to allocate
Host LU every this host group. CCI does not use this host LU, and specifies by using absolute
LUN in the port. Therefore, a user can become confused because LUN of the CCI notation
does not correspond to LUN on the host view and Remote Console. Thus, CCI supports a way
of specifying a host group and LUN on the host view."

It would be useful if someone from HDS could confirm if I've read the manual correctly (or if there's a later release) as I still don't think replication based on LDEV is possible. The sooner it is brought in the better; the current method is just way too risky for my liking.

Wednesday, 6 June 2007

RTFM

Quite a while back, I posted on the Green Datacentre. Actually, the post was over a year ago and datacentre power and cooling issues have become a really hot (sic) issue.

However, I have to castigate myself severely for not Reading the Flaming Manual. In particular the power demands of the Cisco 9513 switch. Really the problem comes from working in an environment where power/cooling isn't wasn't an issue - just whack it in and get the electricians to provision whatever power you need. No need to actually validate the power requirements - just overprovision and everything will be fine. But recently, I wanted to power a 9513 up outside the datacentre to do pre-installation checks. The question was, could I power it up from a normal 13A socket? With 6000W of power required, surely not! Well, this is where RTFM comes in.

Thumbing through the 95xx Installation Guide I found that the chassis and fan trays take 318 watts, supervisor cards take 126 watts each and the biggest module, 48 port line-cards, 195 watts each. Therefore a fully populated 9513 would require 2715 watts. Based on UK power supplies of 240V, that's just over 11 amps. So, technically (although I wouldn't advise it and probably wouldn't try it) a 13A power supply would suffice.

So what's with the 6000W power supply? In fact, the power supplies in the 9513 are 6000W *capable* but don't have to demand or supply that level of power. Each takes two power feeds (from separate PDUs hopefully) and if only one feed is available, the power supply can only provide 2900W. If you think about it, what that means is that overall, 2 power supplies can provide ample power to run the switch without being under stress (and therefore provide more consistent, smooth power). Should a PDU fail, then each PSU can supply 2900W, again easily enough to run the whole switch. If *at the same time* one PSU should also fail, the remaining power supply can still run the whole switch. It may not be a desirable situation but it is possible. In terms of what power/cooling requirement should be catered for, it is actually a maximum of 2715 watts.

It just shows what reading the manual can do for you.

Certification Nightmare

One of the things I've always found a real issue in heterogeneous environments is that of certification. By certification I mean confirming that an installation is vendor supported. This is no mean feat in today's storage world. There are multiple layers - storage array, fabric, HBA, O/S, logical volume manager and multipathing software. There are also issues of virtualisation, product co-existence (e.g mixing switches from the same vendor but different OEM vendors), management tools, replication software and applications.



The stack of products and levels that must be considered is huge, especially if each layer has multiple products. It is also true that certain vendors don't help this situation (a) when new software is required to support a new array (for example) and it affects all other layers of the stack (b) competing vendors don't fully support their competitor's technology until well after the release date.



I've not yet seen a suitable solution for tracking certification levels. Firstly, even the vendors can't document their own products in a consistent fashion. Consider Emulex - querying the driver level on a Windows host returns a string which *contains* the version, but isn't in a consistent format. The returned string is different for Emulex on Solaris.



I think we've reached the point where we need a Certification Authority. Someone needs to take control and get all vendors to map their products to a consistent standard. Actually, I'd like to do it and I'm working on a markup language to help. It wouldn't be difficult. Probably the most troublesome part would be assigning an index code to each version/product/level that vendors produce - and getting them to use them....

By the way, in case anyone suggests SAN Advisor or Onaro SANScreen, I don't think either of these products are up to the mark. I've seen SAN Advisor demonstrated a number of times and it looked like a generation 1 product. SANScreen was interesting but not a certification tool as such - more like a semantic rather than syntactic storage product.