Wednesday 3 January 2007

My Favourite De-Duplication Technology

Here's one of my favourite websites; www.shazam.com. In fact it isn't the website that is the favourite thing, it's what Shazam do. In the UK (apologies for US readers, I don't know your number), dialling 2580 down the middle of your 'phone and holding your mobile up to a music source for 30 seconds will give you a text back with the track title and the artist. Seeing this for the first time is amazing; as long as the track is reasonably clear, any 30 second clip will usually work. I've astounded (and bored) dozens of friends and it only costs me 50p each time.

So, Shazam got me thinking. How can they track the almost millions of music tracks in existence today and match this to a random clip of music I provide over a tinny link from my mobile phone? The most obvious issues are those of quality; I've almost only ever used the service in a bar with a lot of background noise (mostly druken colleagues). That aside, I tried to see how they could have indexed all the tracks and still allowed me to provide a random piece of the track against which they match.

I started thinking about pattern matching and data de-duplication as it exists today. Most de-dupe technology seems to rely on identifying common patterns within data and indexing those against a generated "hash" code which (hopefully) uniquely references that piece of data. With suitable data containing lots of similar content then a lot of duplication can be removed from storage and referenced as pointers. Good examples would be backups of email data (where either the user or a group of users share the same content) and database backups where only a small percentage of the database has changed. The clever de-dupe technology would be able to identify variable length patterns and determine variable start positions (i.e. byte level granularity) when indexing content. This would be extremely important where database compression re-aligns data on non-uniform boundaries.

Now this is where I failed to understand how Shazam could work; OK, so the source content could be de-duped and indexed, but how could they determine where my sample occurred in the music track? A simple internet search located the following presentation from the guy (Avery Wang) who developed the technology. The detail is here http://ismir2003.ismir.net/presentations/Wang.PDF. The de-dupe process actually generates a fingerprint for each track, highlighting specific unique spectogram peaks in the sounds of the music, then uses a number of these to generate hash tokens via a technique called "combinatorial hashing". This uniquely identifies a track, but also provides the answer as to how any clip can be used to identify a track; the relative offsets of each hash token is used to identify the track, so the absolute offset of the sample isn't important.

Anyway, enough of the techie talk, try Shazam - amaze your friends!

No comments: