Who is that doggy in the window?

Roy mentions Shazam, a music-identification-by-mobile-phone service which is -- apparently -- both expensive and addictive. (My phone is on Virgin Mobile, which doesn't peer with Shazam or whatever the technical term is, so I've never used it myself. But my friends alternate between raving about it and complaining about the amount they spend on it at 59p per go.)

How does Shazam work?

(Here I should point out that I know nothing about music, so I'm going to talk about an entire field of cultural endeavour reduced to a vaguely diverting signal-processing problem. Whatever.)

Shazam's internals are, no doubt, `commercially confidential'. But there is some information on possible algorithms on the web. Cheng Yang from the Stanford database group has published a number of papers on a fairly simple music matching algorithm; this 2001 paper and this one from 2002 give a reasonable summary.

The basic game is to compute spectrograms at points of peak intensity in the music; the argument here is that a piece of music is, roughly speaking, characterised by how the loudest bits sound. At each peak in each piece of music, the spectrogram is divided into 180 buckets with frequencies between 200 and 2000Hz, and an entire piece of music is described by a sequence of such 180-value spectrograms, stored with their time position in the song.

Matching is done by computing a similar sequence of spectrograms from the query (as recorded from your telephone or whatever) and searching for similar spectrograms in the database of known songs. The idea is to record t1 = offset-of-match-in-song for each spectrogram against t2 = offset-of-match-in-query, and then assess whether the query comes from the song by seeing whether it has the same peak spectrograms in the same sequence.

A perfect match -- in which the same peaks occur at the same times in the song and query -- would have t1 = a + t2 (since the query can come from any part of the song); a match between songs in different tempos would be t1 = a + b t2. The best matching a and b are found using linear regression in the first paper, and in the (more sophisticated) second paper, using the Hough transform beloved of computer vision researchers to pick up linear features. (The present implementation doesn't search for spectrograms which are similar after a pitch shift, with the result that it won't match music which has been transposed. Doing so would add another free parameter to the matches, obviously.)

The search step is computationally intensive. Searching for matching 180-element spectrograms is not easy. Yang uses an efficient procedure called `locality-sensitive hashing' to make this tractable. (There's a brief description in the second paper linked above, and a Google search for the term yields lots of information -- often from people using the technique for matching DNA sequences, in many ways a similar problem. The basic idea is something like this: pick a large number of random basis vectors v_i, and then for each data vector d record the result of d·v_i > 0 as a hash value h_i. Catenate the results of comparisons against all the v_i as H; if H for the query and H for a data vector match, compute the similarity between the two properly (and slowly) and add the data point to the matches if it is close enough. To find all (rather, most) of the similar data vectors, repeat the above procedure for lots of sets of v_i. There's a clear but much more mathematical description in these lecture notes, which also proves why this actually works, if intuition isn't enough.)

Unfortunately, even the optimised search procedure takes about ten times the duration of the input clip to search for matches; that is, a 30-second clip results in a search of about five minutes (see graph at end of 2002 paper). This could probably be improved using some DSP-tastic custom hardware but for a large-scale application like Shazam more optimisation would be needed to make the searches economic.

With a small corpus (220 piece of mainly classical music) Yang obtains better than 60% accuracy (defined as `correct result among top five matches') for thirty-second samples, with over 80% accuracy when the exact same piece is being searched for (the algorithm works for variants and derivative works, too, but less well; the idea is that, given one piece of music, it could find, for instance, the same piece recorded by different performers). Obviously it's the second case which is more important for a Shazam-type application, since Shazam push you towards buying CDs of the music you hear.

I don't know if this is exactly how Shazam actually works but I'd guess it isn't far off. Obviously something must have been done to improve the execution time. I suspect that if we only really care about exact matches then a less sophisticated search step could be used; in particular, we wouldn't need to match pieces in different tempo and it may be possible to just represent each song as the catenated peak spectrograms and search those.

If all else fails, well, computers are fast and memory is cheap. Throwing money at the problem works for that other information-retrieval success story, Google.

(Two other thoughts occur to me. Yang's algorithm -- at least in a toy version -- wouldn't be that hard to implement for someone with some spare time and a passing familiarity with the fast Fourier transform, and the problem is quite interesting. Yang bemoans the lack of a widely distributed and standard corpus of music to test against. But -- and at this point I am compelled to point out that using it would probably be an infringement of copyright -- the material available through file-sharing networks like Gnutella could provide an extensive and already machine-readable corpus with some descriptive text for each track. Samples from pop music radio stations would be a useful source of test queries, and some stations will provide details of their playlists, which would make automated testing tractable.... Anyway, if you do wander off and implement this to compete with Shazam, I'd be interested to hear how you get on -- especially if you're raking in 59p/query and want to give me some of it....)

2 September, 2003: Who is that doggy in the window?