Saturday 7 April 2007

Distributed Backup

Following on from a previous post on RAID and backup, I've been doing some more thinking on how to back up consumer data from a PC workstation. I reckon I've got about over 200GB of data on my server, which previously was on my main workstation. I dabbled with the Linksys NSLU2 however I hated it; I was really nervous (a) about the fact I would lose access to the device (b) it couldn't cope with the volume of files and seemed to lose track of what I had allocated and (c) how I would recover the data from the USB drives I used if the device eventually packed up. In fact, I got rid of the NSLU2 when it did lose track of my data. I was lucky to find a Windows read-only driver capable of reading the NSLU2 format and I got my data back.

Getting back to the question in hand, how would I back up 200GB? I guess I could fork out a few grand for an LTO drive and tapes, but that's not cost effective. I could do disk to disk copy (which I do) but D2D isn't as portable as tape and much more expensive if I intend to maintain multiple copies. I should mention that I've automatically discounted DVD and HD-DVD/Blu-Ray due to lack of capacity and cost (the same applies to the latest optical drives too).

I could use one of the many network backup services on offer. About 10 years ago, I looked at the feasibility of setting up one of these services for the storage company I worked for at the time. It was almost feasible; Freeserve was doing "free" dial-up internet (you paid for just the cost of the calls) and companies such as Energis were selling virtual dial-up modems on very good terms. However the backup model failed as the cost to the customer just didn't stack up due to the length of time to copy files out to the backup service.

I think network backup services *could* be the best answer to safeguarding your PC/workstation data. However the existing services have issues for me; basically I don't trust someone else with my data, which could include bank details, confidential letters and files. Even if I can encrypt my data as it is transmitted to the network backup service, they still have *all* of my data and with enough compute power could crack my encryption key.

If anyone has examples of services which could provide 100% security, I'd be interested to know.

So, here's my idea. a distributed backup service. We all have plenty of free space on those 500GB drives we've installed. Why not distribute your backups amongst other users in a peer to peer fashion? There are two main drawbacks to my mind. First, how can I guarantee my data will always be available for me to access (PCs may be powered off) and secondly, how can I guarantee security?

Existing P2P services work by finding as many servers/PCs as possible which hold the data you want to download. Many may not be online; many may be online and running slow. By locating multiple copies of the required data, then hopefully one or more will be online and available for download.

The same can be applied to backups; split files up and distribute the fragments to P2P servers and index where they are. The fragments would need to be encrypted in some way to guarantee anonymity but make them common to files on both your machine and others. You then maintain an index to rebuild the backup data; all that then needs to be backed up is the index which could easily fit onto something like a CD-ROM. All data could then be recovered using just a small CD index, which could be recreated from anywhere.

There are a lot of unanswered issues; how would data be encrypted; how would the fragments be created and "de-duplicated", how would fragments be distributed across the P2P members to ensure availability? How would the fragments be created to prevent the actual original files from being discovered?

Still, it's only a concept at this stage. But using the internet and all that unused disk space out there could prove a winner.

3 comments:

Ewantoo said...

There's already a company out there doing this called Vembu who have a software client called StoreGrid you install on each PC on a network in your office, then each can backup data for other PCs out there

Unknown said...

There's another startup called Allmydata that is also pretty far down along this path...

john said...

And DIBS... http://web.mit.edu/~emin/www/source_code/dibs/

http://sourceforge.net/projects/dibs