Thursday 22 March 2007

Uh Oh Domino

It seems that the world is moving to Exchange for email messaging. Unfortunately there are some of us still using Lotus Notes/Domino.

As a messaging product, it seems to me to be reasonably efficient; our Domino servers can support upwards of a thousand users, perhaps 1-2TB of Notes mailboxes. Domino stores the mailboxes as individual files with the .nsf extension. Each of these is opened and held by the nserver.exe task. When using Netbackup with the Notes/Domino agent, the Netbackup client backs up all nsf files on a full backup and the transaction logs and changed nsf files (i.e. those with a new DBID) for an incremental backup. This creates a significant amount of hassle when it comes to performing restores.

A restore is either the full nsf file from the last full backup, or the nsf file plus transaction logs, which are then applied to the nsf file to bring the mailbox up to date. This process is incredibly inefficient because (a) transaction logs contain data for all users and must be scanned for the records relating to the restoring mailbox (b) the transaction logs need to be restored to a temporary file area, which could be considerable (c) the restored logs are disregarded after the restore has completed and so have to be restored again for the next mailbox restore.

So, I’ve been looking at ways to bin Netbackup and improve the backup/restore process. As servers are being rebuilt on Windows 2003 Server, I’ve been looking at VSS (Volume Shadowcopy Services). This is a Windows feature which permits snapshots of file systems to be taken in co-operation with applications and underlying storage. In this instance there isn’t a Lotus Domino provider, so any snapshots taken are dirty (however I did find the dbcache flush command which flushes updates and releases all nsf files). Netapp used to have a product called SnapManager for Lotus Domino which enabled Netapp snapshots of mailboxes using the Domino Backup API. The product has been phased out, as tests performed by Netapp show that dirty snapshots with the security of logs can be used to restore mailboxes successfully. IBM provide trial versions of Domino, so, I’ve downloaded and installed Domino onto one of my test servers under VMware and run the load simulator while taking snapshots with VSS. I’ve also successfully restored a mailbox from a snapshot so there’s no doubting the process works. However my simple task isn’t one of scale. Typical mailboxes are up to 1GB in size and there could be hundreds of active users on a system at any one time. My concern is whether VSS can manage to take snapshots with this level of activity (and not impact the O/S) but also whether the snapshots will be clean or what level of corruption we can expect.

The only way to test this is to implement on a full scale Domino environment and probably with live users. That’s where things could get interesting!

2 comments:

Anil Gupta said...

Hello Chris,

Sorry for unrelated comment, couldn't find your email address on the blog.

I am attending the SNW in San Diego. I am contacting fellow storage bloggers to inquire about their plans for SNW. If you are attending SNW, will you be interested in attending a
gathering of fellow storage bloggers? Any thoughts on any such
gathering are most welcome.

I recently wrote a blog post on this topic
http://andirog.blogspot.com/2007/04/say-hello-at-snw.html. I will appreciate any help in spreading the word to fellow bloggers and storage professionals.

Thanks.

Anil

Unknown said...

Hi Chris

Interesting post indeed.

Can you tell me more about how you implemented the VSS/dbcache flush process? I am no Domino expert at all! So sorry for any daft ideas or concepts.

We backup a number of customer Domino databases using VSS and hit the problem of databases being in cache during the snap-shotting.

When restored, these databases are then not able to be read:

"Invalid or non existent document"
"RRV Bucket is corrupt"
"Page format is incorrect"

So I am interested in how effective the dbflush process is - so that we can run it before the VSS snapshoting:

1) how long would the flushing typically take to happen?
2) what exactly then causes a database to be re-cached - is it the receipt of an email simply or does the use have to open their client? In otherwords, how could we ensure databases are not in cache when the snapshot takes place?

As you are clearly and expert, I was wondering if you had any ideas as to fixes to the errors (above)we see when databases are backed up in the crash state?

Would the tools like fixup and compact that I have read about typically be of any use to make the databases readable?

Many thanks for any help here.