Tuesday 2 September 2008

How Many IOPS?

A question I get asked occasionally is; "How many IOPS can my RAID group sustain?" in relation to Enterprise class arrays.

Obviously the first question is to determine what the data profile is, however if it isn't known, then assume the I/O will be 100% random. If all the I/O is random, then each I/O request will require a seek (move the head to the right cylinder on the disk) and the disk to rotate to the start of the area to read (latency) which for 15K drives is 2ms. Taking the latest Seagate Cheetah 15K fibre channel drives, each drive has an identical seek time of 3.4ms for reads. This is a total time of 5.4ms, or 185 IOPS (1000/5.4). The same calculation for a Seagate SATA drive gives a worst case throughput of 104 IOPS, approximately half the capacity of the fibre channel drive.

For a RAID group of RAID-5 3+1 fibre channel drives, data will be spread across all 4 drives, so this RAID group has a potential worst case I/O throughput of 740 IOPS.

Clearly this is a "rule of thumb" as in practical terms, not every I/O will be completely random and incur the seek/latency penalties. Also, enterprise arrays have cache (the drives themselves have cache) and plenty of clever algorithms to mask the issues of the moving technology.

There are also plenty of other points of contention within the host->array stack which makes this whole subject more complicated, however, when comparing different drive speeds, calculating a worst case scenario gives a good indication of how differing drives will perform.

Incidentally, as I just mentioned, the latest Seagate 15K drives (146GB, 300GB and 460GB) all have the same performance characteristics, so tiering based on drive size isn't that useful. The only exception to this is when a high I/O throughput is required. With smaller drives, data has to be spread across more spindles, increasing the available bandwidth. That's why I think tiering should be done on drive speed not size...

6 comments:

BarryWhyte said...

Chris, totally agreed. Its not a capacity statement, but a performance statement. 15K, 10K, 7.2K, and then SATA!

It does however make difference now that Samsung are offering 32MB cache models - ok SATA desktop drives, but the difference with a 'Spinpoint' with the larger cache over even 10K RPM SATA Raptors is amazing!

Provisioning needs to take into account, drive speed, RAID type (and in a SVC type model) controller type. The differences between one vendors RAID-x and anothers is quite amazing. I'd be in trouble if I quotes names... but think about who was last to market to bring pseudo RAID-5 ...

Thats where a 'pooling' strategy helps. Define your gold, silver and bronze (maybe platinum too these days with flash) and then provision from a pool. The key thing here being a 'price to your internal customers' so maybe a SATA drive only makes a minimal difference to your overall costs if you put it in a big monolith - say DMX. But if you buy a SATA based controller, like DS4200, then the overall cost for 1TB is *way* lower. Thus being able to pass on the saving to your customers has a benefit, not only for the bronze, but the platinum users.

A system that can provide just such a varied cost point would be nice wouldn't it... guess what, thats where SVC comes in :)

rwhiffen on gmail said...

Although the drives may have the same performance characteristics, the larger drives hold more data, and there for have more tickets in the random I/O lottery and can have it's numbered called to the I/O dance floor more often than a smaller drive right next to it. I usually put it as a secondary consideration when making storage decisions because, as you said, cache and OS components can mask things. The drive size probably won't change 'if' the performance is going to hockey stick, but it might change 'when'. If I'm in a situation where 146gb vs 300gb is going to make or break my performance, I've done something seriously wrong.

Great post... keep it up.

Unknown said...

"That's why I think tiering should be done on drive speed not size..."

Errrr.... not exactly. In my less than humble opinion, the data itself should determine the tiering based on the performance required.

-James Orlean
StorageMonkeys.com

Chris M Evans said...

Guys

Thanks for the comments. Barry, I agree with your comments on service. I don't like the terms tier 1/2/3; your medal comparison suits better - and what a great lead in to advertise SVC! :-)

James, I guess what I was meaning was that I don't see enough of a differentiator between 146 & 300GB 15K drives to call them separate tiers. I think performance is the usual tier 1/2 consideration, so drives of the same performance aren't going to meet that requirement.

rwhiffen, thanks for the words of support!

Blaese said...

Enjoyed reading your little review of IOps performance and it sparked a couple of thoughts that you might find interesting - posted on my blog at http://www.infrageeks.com/groups/infrageeks/weblog/bab55/How_Many_IOPS_.html

I do a lot of work in storage sizing for VMware deployments and it remains one of the fuzziest parts of the project. Anything we can do to normalize this is a step in the right direction.

Cheers!

Rob said...

"How many IOPS can your [Enterprise] RAID group sustain?"

More than you post depending on workload and other things. But in general more.

A quick google reveals things like this:

270 IOPs per FC drive

tagged command queuing allows you to push them that hard. Of course response time tails off and that would be punishing to drive them that hard if interactive use was the norm.

Secondly, as mentioned, what about usage? What if the DBAs did a normal morning import of data and the array is getting hammered with large IOs for a time? Writes and transfer time for the larger IOs come into play so maybe that 185 IOPS you are quoting is too generous.

"With smaller drives, data has to be spread across more spindles, increasing the available bandwidth."

Not sure drive size comes into play unless your "Enterprise array" is feature-poor. Barry suggests you can front-end with SVC. You can carve and migrate to meta-LUNs touching more arrays in a clariion. You could carve your LUNs out of a 100 disk pool in an EVA. Or (duck) try an XIV and furgetaboutit. Point not to be overlooked is LUN management can overcome IO issues and future-bound a number of vendors will be working with 1MB "partitions" and side-stepping LUN management pain.