Monday, 17 December 2007

Taking out the trash

In a recent post, Hu Yoshida refrences an IDC presentation discussing the rate of growth of structured versus unstructured data. It seems that we can expect unstructured data to grow at a rate of some 63.7% annually. I wonder what actual percentage of this data represents useful information?

Personally I know I'm guilty of data untidiness. I have a business file server on which I heap more data on a regular basis. Some of it is easy to structure; Excel and Word documents usually get named with something meaningful. Other stuff is less tangible. I download and evaluate a lot of software and end up with dozens (if not hundreds) of executables, msi and zip files, most of which are cryptically named by their providers.

Now the (personal) answer is to be more organised. Every time I download something, I could store it in a new structured folder. However life isn't that simple. I'm on the move a lot and may download something at an Internet cafe or elsewhere where I'm offline from my main server. Whilst I use offline folders and synch a lot of data, I don't want to synch my entire server filesystem. The alternative is to create a local image of my server folders and copy data over on a regular basis, trouble is, that's just too tedious and when I have oodles of storage space, why should I bother wasting my time? There will of course come a time when I have to act. I will need to upgrade to bigger or more drives and I will have (more) issues with backup.

How much of the unstructured data growth out there occurs for the same issues? I think most of it. I can't believe we are really creating real useful content at a rate of 63.7% per year. I think we're creating a lot of garbage that people are too scared to delete and can't filter adequately using existing tools.

OK, there are things out there to smooth over the cracks and partially address the issues. We "archive", "dedupe", "tier" but essentially we don't *delete*. I think if many more organisations operated a strict Delete Policy on certain types of data after a fixed non-access time, then we would all go a long way to cutting the 63.7% down to a more manageable figure.

Note to self: spend 1 hour a week tidying up my file systems and taking out the trash.....


Carl said...

I absolutely agree. Somebody smart said that the Greenest bit is the one you Delete.

We (including me!) store so much cruft because we're too lazy to separate the wheat from the chaff. My issue is that when I do need something from that "archive", I'm more likely to either a) download it again, or b) ask somebody if they can find it.

Anil Gupta said...


How do you determine when certain data can be deleted? For every organization and person, the deletion criteria may be different. Deleting data due to extended non-access may not be a prudent approach in most cases. Consider financial records, tax returns, investment purchase/sell records, and medical records for example.

For a decade now, I always had a policy to not delete anything, now I have reached a point where I am trying to build a multi-terabyte 2.5" SAS drive based storage server for my home. ;-)

Going forward, any duplicate elimination technique will be a must for most type of data storage.


Chris M Evans said...

Carl, yes, I'm sure I've re-downloaded stuff multiple times - usually drivers which I meant to file properly and didn't.

Chris M Evans said...


I agree everyone's criteria will be different; so, you could (for example) have a delete policy on non-personal data which remains unreferenced, say JPG and MP3 and WMV files after a certain length of time. After that, anything which *may* reference someone would need other rules; it may be possible in some circumstances to delete all files for someone leaving a company (others it may not).

I think we need to start looking at what types of data we are storing instead of calling it unstructured. Those files within themselves are structured and can be validated. That has to be the next stage in beating the growth of garbage.

Stephen said...


I have two backup programs running at home. One is for our personal stuff that resides in my document type directories and the other is some system thing. I was thinking that the average user might be happy with some sort of software that backs up their stuff but reads the file magic and then gives them the opportunity to say, delete me at such a time. Currently, I don't have much control over my data and I would like it to change.

As you say, downloads are a major source of rubbish on my system and when I occasionally have to do a backup, I find junk from years ago still available.

I went to a EMC talk once and they harped on something fierce about unstructured data especially on ipods and mobile/cell phones. I said it was not my problem as I only look after enterprise storage and so did the rest of the people at the presentaton. That sort of shut them up for a few moments.

I bet I have at least four copies of my mp3's and that must use many tens of GB's. Then my new HiDef movies use about 20 MB per second. So they sit on my media server and my backups. So far I only have 1.5 TB at home...

I wont go on as I feel even more lost now.