Tuesday 26 June 2007

Using Tuning Manager for something useful - Part I

Stephen commented about Tuning Manager and doing something useful with the data. I thought I would use this as an opportunity to highlight some of the things I look at on a regular basis (almost daily, including today in fact). Part I - Write Pending.

First a little bit of background; Write Pending is a measure of unwritten write I/Os which are in cache and haven't been destaged to disk. A USP/9980V (and most other arrays) store write I/O s in cache and acknowledge successful completion to the host immediately. The writes are then destaged to disk asynchronously some time later.

Caching writes (which I could be wrong, could have been invented by IBM and called Cache Fast Write or DASD Fast Write, I can't remember which was which) provides the host with a consistent response and helps to remove the mechanical delay of writing to disk. On average, most I/O will be a mixture of reads and writes and therefore an array can even out the write I/O to provide a more consistent I/O response. I see consistent I/O response as one of the major benefits or features of enterprise arrays.

Tuning Manager collects Write Pending (expressed as a percentage of cache being used for write I/O) in two ways; you can use Performance Reporter and collect real-time data from the agent/command device talking to the array. This can be plotted on a nice graph down to 10 second intervals. You can also collect historical WP data, which is retrieved from the agent on an hourly basis, providing an average and maximum figure. Hourly stats are collected for a maximum of 99 periods (just over 4 days) and can be aggregated into daily, weekly and monthly figures.

A rising and falling WP figure is not a problem and this is to be expected in the normal operation of an array, however WP can reach critical levels at which a USP will perform more aggressive destaging to disk; these figures are 30% and 40%. Once WP reaches 70%, inflow control takes over and starts to elongate response times to host write requests to limit the amount of data coming into the array. This obviously has a major impact on performance. I would look to put monitoring in place within Tuning Manager to alert when WP reaches these figures. Why would an array see high WP levels? Clearly, all the issues are write I/O related but they could include high levels of backup, database loads, file transfers and so on. It could also be an issue with data layout.

If an array can't cope with WP reaching 30%+ then some tuning may be required; consider distributing workload over a wider time period (all backups may have been set up to start at the same time, for instance) and investigate for write activity how that data is laid out; distributing the writes over more LUNs and array groups could be a remedy. It also may be possible that you have to choose faster array group disks for certain workloads and/or purchase additional cache.

I think Tuning Manager could do with a few improvements in terms of how Write Pending data is collected. Firstly, unless I'm running Performance Reporter, I have no idea within an hourly period when the max WP was reached and for how long it stayed at this level. Some kind of indicator would be useful - an hour is a long time to trawl through host data.

I'd also like to know, when the Max WP occurs, which LUNs had the most write I/O activity. LDEV activity is also summarised on the hour so it's not easy to see which LUNs within an hourly interval were those causing the WP issue; as an example, an LDEV could have been extremely active for 10 minute and caused the problem but, on average could have done much less I/O than another LDEV which was consistently busy (with a low I/O rate) for the whole 1 hour period. The averaging over 1 hour has a tendency to mask out peaks.

Finally, I'd like to be able to have triggers which can start more granular collection. So when a critical WP issue occurs, I want to collect 1 minute I/O stats for the next hour to an external file. I think it is possible to trigger a script from an alert, but I'm not clear on whether that trigger occurs at the point the max WP value is reached or whether it is triggered when the interval collection occurs on the hour (by which time I could be 59 minutes too late).

Next time, I'll cover Side-file usage for those who run TrueCopy async.

9 comments:

Rick@HDS said...

Chris:

If you set up an alert through the master console, then it does only check at the next polling cycle (default - 1 hour). However, you can create an alert to trigger from the storage system agent which would trigger within a minute.

BTW, HDS is starting up bi-weekly "Power Hour" webinars. The first one, July 25th at 9am PDT, is on how to create agent-level alarms in Tuning Manager. Announcements will be going out in the next couple of weeks. Info should be posted on our web site and on the HDS user forums.

Stephen said...

Chris,

Excellent background reading. Our WP falls nicely within the limits you suggested.

I rummaged through HTnM Performance Reporter (PR) and it seems that Write Pending Rate and Max Write Pending Rate is not there unless I am blind.

It appears in the Resources -> Storage reporting window. It might be using a different name in PR?.

On a side note, I thought I would check out:

"Subsystem Cache Memory Usage Status" and "Array Group Write Cache Hit Rate". Both present zero (0). I have a funny feeling that HTnM has an issue with cache on the USP's as they never report anything other than zero.

The Resources -> Storage reporting window always has 0.0% Cache usage on the USP but the 9980V has non zero information normally around 99%. Do you know of a bug with the USP and HTnM?

Stephen

Stephen said...

Rick,

What about all us people in the Asia area? That time is well past midnight in Australia/Japan/etc.

If you run it at 4PM, that probably would suit a more global audience.

Stephen

Chris M Evans said...

Stephen if you look at Performance Reporter, there are "Troubleshooting" reports; you can look at the recent history report and cache usage is in there.

Stephen said...

I'm a moron. I did not expand the page out enough to see the report. I generally tend to spend more time with HTnM on AG issues.

I found it in "Subsystem Cache Memory Usage Details(6.0)".

I still get Cache Memory Usage of 0. Do you see something different on your USP's?

Stephen

Chris M Evans said...

Rick, I'll have a look at the agent-based alerts, however my view would be that if you create a product which provides centralised reporting, then all features should be available in the main product. It shouldn't be necessary to log onto the agent server to do this work - especially if these are in different datacentres with multiple access/security requirements. I look forward to seeing the webinar announcements.

Stephen said...

When I did my training on the USP, the instructor gave me some thresholds which I just rediscovered. I am still waiting on the official word from HDS. I thought you and anyone else who reads this blog might find them interesting.

CHIPs < 70%
ACP's < 70%
WP rate < 30% (interesting)
PG's < 50%
LDEV's < 50%
Cache utilisation > 90%

I wonder are they gospel or not?

Stephen

Chris M Evans said...

Stephen

My understanding is that the WP rate is definitely a fixed figure, as is the 70% limit at which inflow control takes over. Not sure about the others you list. There are some I know are configurable, such as async TrueCopy limits. As promised I will provide more over the coming days of those I know and would monitor.

Unknown said...

FYI, information on the technical webcasts I mentioned is now available at www.hds.com/webtech. I understand the timing is convenient for the U.S. market, however the webcasts will be recorded and should be available for download following each session.