Thursday 22 March 2007

Uh Oh Domino

It seems that the world is moving to Exchange for email messaging. Unfortunately there are some of us still using Lotus Notes/Domino.

As a messaging product, it seems to me to be reasonably efficient; our Domino servers can support upwards of a thousand users, perhaps 1-2TB of Notes mailboxes. Domino stores the mailboxes as individual files with the .nsf extension. Each of these is opened and held by the nserver.exe task. When using Netbackup with the Notes/Domino agent, the Netbackup client backs up all nsf files on a full backup and the transaction logs and changed nsf files (i.e. those with a new DBID) for an incremental backup. This creates a significant amount of hassle when it comes to performing restores.

A restore is either the full nsf file from the last full backup, or the nsf file plus transaction logs, which are then applied to the nsf file to bring the mailbox up to date. This process is incredibly inefficient because (a) transaction logs contain data for all users and must be scanned for the records relating to the restoring mailbox (b) the transaction logs need to be restored to a temporary file area, which could be considerable (c) the restored logs are disregarded after the restore has completed and so have to be restored again for the next mailbox restore.

So, I’ve been looking at ways to bin Netbackup and improve the backup/restore process. As servers are being rebuilt on Windows 2003 Server, I’ve been looking at VSS (Volume Shadowcopy Services). This is a Windows feature which permits snapshots of file systems to be taken in co-operation with applications and underlying storage. In this instance there isn’t a Lotus Domino provider, so any snapshots taken are dirty (however I did find the dbcache flush command which flushes updates and releases all nsf files). Netapp used to have a product called SnapManager for Lotus Domino which enabled Netapp snapshots of mailboxes using the Domino Backup API. The product has been phased out, as tests performed by Netapp show that dirty snapshots with the security of logs can be used to restore mailboxes successfully. IBM provide trial versions of Domino, so, I’ve downloaded and installed Domino onto one of my test servers under VMware and run the load simulator while taking snapshots with VSS. I’ve also successfully restored a mailbox from a snapshot so there’s no doubting the process works. However my simple task isn’t one of scale. Typical mailboxes are up to 1GB in size and there could be hundreds of active users on a system at any one time. My concern is whether VSS can manage to take snapshots with this level of activity (and not impact the O/S) but also whether the snapshots will be clean or what level of corruption we can expect.

The only way to test this is to implement on a full scale Domino environment and probably with live users. That’s where things could get interesting!

Friday 16 March 2007

WWN Decoder

As pointed out by Richard, my WWN decoder stopped working when I redid my website. Here's a new link; http://www.brookend.com/html/resources/wwndecoder.asp.

Developing a Tiering Strategy

Implementing a storage tiering strategy is a big thing these days. Everyone should do it. If you don't then you're not a "proper" storage administrator. Being serious and moving away from the hype for a second, there is a lot of sense in implementing tiering. It comes down to 1 thing - cost. If disk and tape storage was free, we'd place all our data on the fastest media. Unfortunately storage isn't free and therefore matching data value to storage tiers is an effective way of saving money.

Choosing the Metrics

In order to create tiers it's necessary to set the metrics that define different tiers of storage. There are many to choose from:


  • Response time
  • Throughput
  • Availability (e.g. 5 9's)
  • Disk Geometry (73/146/300/500GB)
  • Disk interconnection (SATA/FC/SCSI)
  • Usage profile (Serial/Random)
  • Access Profile (24x7, infrequent)
  • Data value
  • Array Type (modular/enteprise)
  • Protection (RAID levels)

There are easily more, but these give you a flavour of what could be selected. In reality, to determine the metrics to use, you really need to look at what would act as a differentiator in your environment. For example, would it be really necessary to use 15K speed drives rather than 10K? Is availability important - should RAID6 be considered over RAID5? Is there data in the organisation that would exist happily on SATA drives rather than fibre channel? Choosing the metrics is a difficult call to make as it relies on knowing your environment to a high degree.

There are also a number of other options to consider. Tiers may be used to differentiate functionality, for example tiers could be used to specify whether remote replication or point-in-time copies are permitted.

Is It Worth It?

Once you've outlined the tiers to implement, you have to ask a simple question - will people use the storage tiers you've chosen? Tiering only works if you can retain a high usage percentage of the storage you deploy - it's no use deploying 20TB of one tier of storage and only using 10% of it. This is a key factor. There will be a minimum footprint and capacity which must be purchased for each tier and unless you can guarantee that storage will be used, any saving from tiering may be negated by unused resources. Narrow your tiering choices down to those you think are actually practical to implement.

Making the Move

So, the tiers are set, storage has been evaluated and migration targets have been identified. How do you make it worthwhile for your customers to migrate? Again, things come back to cost. Tiers of storage will attract differing costs for the customer and calculating and identifying the cost savings will provide a justification for investing in the migration. In addition, tiers can be introduced as part of a standard technology refresh - a process that regularly happens anyway.

Gotcha!

There are always going to be pitfalls with implementing tiering:

  1. Don't get left with unusable resources. It may be appealing to identify lots of storage which can be pushed to a lower tier. However, if the existing tier of storage is not end-of-life or unless you have customers for it, you could end up with a lot of high tier unused storage which reflects badly on your efficiency targets. Make sure new storage brought in for tiering doesn't impact your overall storage usage efficiency.
  2. Avoid implementing technology specific tiers which may change over time. One example; it is popular to choose to tier by drive size on the assumption that higher capacity drives offer a lower performance and therefore are matched to a lower tier. But what happens when the predominant drive type changes or you buy a new array in which the larger drives perform equally well compared to an older array? How should those tiers be classified?
  3. Be careful when choosing absolute parameters for tiers. For example, it is tempting to quote response time characteristics in tiers. However, no subsystem can guarantee consistent response times. It may be more appropriate to set confidence limits, such as offering "<>

Iterative Process

Developing a tiering strategy is an iterative process which will constantly be refined over time. There's no doubt, that implemented correctly, it will save money. Just don't implement it and forget about it.

Wednesday 14 March 2007

Standards for Measuring Performance

I had a bit of banter following a post I made on ITToolbox earlier in the week. The question was posed as to why major disk array vendors (other than Netapp) don't produce performance statistics. There are benchmarks; the Standard Performance Evaluation Corporation and Storage Performance Council for instance. However SPEC only covers NAS and neither HDS and EMC sign up for SPC.

Producing a consistent testing standard is extremely difficult. Each storage array vendor has designed their hardware to have unique selling points and more importantly, each vendor can't make their systems too close to their rivals otherwise there'd be more money made by lawyers than the vendors themselves.

Let's pick a few examples. In terms of architecture, DMX (EMC) and USP (HDS) have a similar design. They both have front-end channel adaptor cards, central cache and back-end loops on which (probably identical) hard disk drives are connected. However the way in which the physical storage is carved up is totally different.

HDS uses the concept of an array or parity group; a 6+2 RAID group has 8 disks of the same size which can be carved up into logical units (LUNs). The address at which these LUNs are mapped is up to the user, but typically LUNs will be dispersed across multiple array groups to ensure that consecutive LUNs are not mapped to the same physical disks. This process ensures that data is hitting as many spindles as possible.

EMC chooses another method. Each physical drive is divided into hypers, or logical slices. These are then recombined to make LUNs. RAID 1 LUNs have 2 hypers, RAID5 LUNs have 4 hypers. Each hyper is taken from a different back-end drive loop to improve performance and resiliency.

Now, the comments earlier in the week referred to Netapp. Their architecture comes from a NAS file serving design, which uses RAID 4 and has all the operations for RAID calculations handled in memory. LUNs are carved out of RAID "volumes" or aggregates.

So what is a fair configuration to use? Should it be the same number of spindles? How much cache? How many backend loops and how many front-end ports? Each vendor can make their equipment run at peak performance but choosing a "standard" configuration which sets a level playing field for all vendors is near impossible. Not only that, some manufacturers may claim their equipment scales better with more servers. How would that be included and tested for?

Perhaps rather than doing direct comparisons, vendors should submit standard style configurations, based on a GB capacity on preset LUN sizes against which testing is performed using common I/O profiles. This, with list price would let us make up our own minds.

Either way, any degree of testing won't and shouldn't stop healthy discussion!

Tuesday 6 March 2007

Software Shortcomings


I was invited last week to speak at the first (of hopefully) many HDS User Forums in the UK. The subject was Storage Resource Management and I talked on the subject and my thoughts for about 30 minutes. One slide generated the most interest and I've included it here. It shows some of the issues current SRM products have and which are not really being addressed by software vendors.

A case in point is the installation of Device Manager I performed this week on a virtual domain. HiCommand Device Manager 5.1 is supported on VMware (2.5) however I couldn't get the software to install at all. I tried the previous version which worked fine, so I was confident the Windows 2003 build was OK. HDS pointed me at a new feature I'd not seen before, Data Execution Prevention which is intended to prevent certain types of virus attacks based on buffer overflows. Whilst this solved the problem, it didn't fill me with a great deal of confidence to think Windows judged HDS's software as a virus. With DEP enabled, the installation got further but still eventually failed. On HDS' advice, re-running the installation again worked.

At the forum, HDS presented their SRM roadmap. If it all comes to fruition then I'll be able to do my provisioning, monitoring and other storage management tasks using my Blackberry whilst sipping Pina Colada's on a Carribbean beach. Back in the real world, my concern is that if the existing tools don't even install cleanly, how am I expected to trust a tool which is moving my disks around dynamically in the background?

It's easy for me to target HDS in this instance but all vendors are equally culpable. I think there's a need for vendors to "walk before they can run". Personally, I'd have more trust in a software tool that was 100% reliable than one which offered me lots of new whizzy features. Vendors, get the basics sorted. Get confidence in your tools and build on that. That way I might get to do provisioning from the beach before I'm too old and grey to enjoy it.

Monday 5 March 2007

Backing Up Branches

I've spent the last few weeks being heavily involved in a storage performance issue. Unfortunately I can't discuss the details as they're too sensitive (in fact, I can't even mention the TLA vendor in question), however it did make me think more about how we validate a storage architecture design will actually work.

As part of another piece of work I'm doing, I have been looking at getting branch or remote office data into core datacentre infrastructure, the premise being not to write tapes in remote locations and therefore be at risk of compromising personal information.

I've looked at a number of solutions, including the Symantec/Veritas PureDisk option (the flashy flash demo can be found here). What strikes me with this and other solutions is the lack of focus on data restores and meeting RTO and RPOs for applications that may run in branch locations. In fact, I think most products are aimed at the remote office which runs a small Windows implementation and therefore isn't worried too much about recovery time of their data.

The technologies such as PureDisk work by reducing the data flowing into the core datacentres, either by de-duplication, compression or both technologies. This is fine if (a) your data profile meets their view of compression (b) your applications/data produce a small percentage of changed data on a daily basis. Unfortunately for some applications this just doesn't occur. In addition, if you lose the site (or even easier, corrupt or lose the local metadata database) then you're looking at a total restore from core to recover an entire server. These solutions work on the basis that branch offices have low WAN bandwidth and the backup tools utilise this by making a small pipe appear as a much larger pipe through the aforementioned de-dupe and compression. This doesn't help if a full restore needs to be performed.

Some people may say "but you could manage the restore by sending a tape to the site containing the last backup" - however that defeats the original objective of keeping tapes out of the branch.
I'd like to see some real-world examples from Symantec/Veritas showing how long backup/restores would take with their product. Of course they're not going to provide that as it would immediately highlight the issues involved. Perhaps we need an independent backup testing lab to do that comparison for all backup products on the market.