Tuesday 17 July 2007

Performance Part V

Here's the last of the performance measurements for now.

Logical Disk Performance - monitoring of LDEVs. There are three main groups Tuning Manager can monitor; IOPS, throughput (transfer) and response time. The first two are specific to particular environments and the levels for those should be set to local array performance based on historical measurement over a couple of weeks. Normal "acceptable" throughput could be anything from 1-20MB/s or 100-1000 IOPS. It will be necessary to record average responses over time and use these to set preliminary alert figures. What will be more important is response time. I would expect reads and writes to 15K drives in a USP to perform at 5-10ms maximum (on average) and for 10K drives to perform up to 15ms maximum. Obviously synchronous write response will have a dependency on the latency of writing to the remote array and that overhead should be added to the above figures. Write responses will also be skewed by block size and number of IOPS

Reporting every bad LDEV I/O response could generate a serious number of alerts, especially if tens of thousands of IOPS are going through a busy array. It is sensible to set reporting alerts high and reduce them over time until alerts are generated. These can then be investigated (resolved as required) and the thresholds reduced further. LDEV monitoring can also benefit from using Damping. This option on an Alert Definition allows an alert to be generated only if a specific number of occurrences of an alert are received within a number of monitoring intervals. So, for instance, an LDEV alert could be created when 2 alert occurrences are received within 5 intervals. Personally I like the idea of Damping as I've seen plenty of host IOSTAT collections where a single bad I/O (or handful of bad I/Os) are highlighted as a problem when 1000s of good fast IOPS are going through the same host.

This is the last performance post for now. I'm going to do some work looking at the agent commands for Tuning Manager, which as has been pointed out here previously, can provide more granular data and alerting (actually I don't think I should have to run commands on an agent host, I think it should all be part of the server product itself, but that's another story).

1 comment:

Stephen said...

Chris,

Excellent reading for all the performance postings. Have you had a lot of success with setting thresholds with HTnM alerting? I either get millions or none at all. It seems a very fine line to get it right so hopefully the upcoming HDS webinar on HTnM will offer some guidance.

Stephen