A Similarity Scale for Content-Based Music IR
Donald Byrd
School of Informatics and School of Music, Indiana University
February 2003; last rev. early June 2007
Music Information Retrieval (music IR)--more technically, content-based music IR--addresses a huge range of tasks. This scope is rarely acknowledged, which is unfortunate because some tasks have very different demands from others, and some are vastly more difficult than others. The main factor, in my view, is just how similar the relevant documents in the collection to be searched really are to the "query". This paper is an attempt to clarify matters.
In the table below:
Not surprisingly, this ordering also ranks music information retrieval tasks from easiest to most difficult; in fact, a strong case can be made that category (1) -- while by no means easy -- is not even a music-IR problem. A related table and graph and an interesting discussion of the range of music-IR tasks appear in Typke et al (2005). Also see Casey and Slaney (2006), which includes a graph and discussion of the range of music-IR tasks from a perspective that is closer to mine.
Music involves much more complex structural relationships than most text or other media; among the reasons are clearly the fact that music is a performing art and the fact that the vast majority of music is, in a broad sense, polyphonic. See Vellucci (1997) for an extensive discussion of the issues from the perspective of library science. Librarians and text-IR researchers often speak of known-item searches, where the user is trying to locate a copy of a document they already know about, as an important special case. The complexity of musical relationships makes it difficult even to say which of these tasks are known-item searches. Category (1) certainly is; the last few categories clearly are not. Other than that, I leave the question to the reader. A related and important question is what the phrase "same music" means. This is not easy, but the best answer may be to appeal to the concept of a musical work, something the emerging library standard FRBR depends on. For a brief discussion, see Tillett (2004).
The categories whose descriptions start in boldface are what mainstream content-based IR systems focus on.
Detailed audio characteristics in common
Relationship category (task) |
Basic representation |
Example systems |
Comment |
1. Same music, arrangement, performance venue, session, performance, & recording |
Audio |
Shazam ("IPR" version), Audible Magic, MusicDNS(?) |
Via audio fingerprint. Current systems are both very accurate and very fast, even with large collections. (See note below.) |
2a, b. Same music, arrangement, performance venue, session, performance; different recording. a: Play back original recording & re-record. b: Different original recording. |
Audio |
(2a) Shazam (public version) |
Via audio fingerprint. (2a) Same comment as for Category 1, though with somewhat less accuracy. (See note below.) |
3. Same music, arrangement, performance venue, session; different performance, recording |
Audio |
none(?) |
E.g., retakes. |
4. Same music, arrangement, performance venue; different session, performance, recording |
Audio |
none(?) |
No detailed audio characteristics in common
Relationship category (task) |
Basic representation |
Example systems |
Comment |
5. Same music, arrangement; different performance venue, session, etc. |
Audio, events |
Foote: ARTHUR |
There's an analogous situation in notation, with nothing different; this is the notation equivalent of Category (1). |
6. Same music, different arrangement; or different but closely-related music, e.g., conservative variations (Mozart, etc.), alternate takes, most covers and remixes, minor revisions |
any |
C-Brahms, Greenstone/Meldex, Musipedia, Pickens et al/OMRAS, Themefinder, etc. |
Current monophonic systems are good, especially with events or notation; polyphonic systems are fair to good. (See note below.) |
7. Different & less closely-related music: freer variations (Brahms, much jazz, etc.), wilder covers, extensive revisions |
any |
none(?) |
A serious AI problem. Current systems are poor. (See note below.) |
8. Music in same (form or style) genre, etc. |
any |
Cuidado, SOMeJB, Tzanetakis(?) |
Agreement even among human experts is limited. |
9. Music influenced by other music |
any |
none(?) |
Agreement even among human experts is limited. |
10. No detectable relationship |
any |
(none possible) |
Notes:
Category 1: The two recordings being compared here are in fact identical. The problem the "IPR" version of Shazam (intended for use by record companies and other music rights owners), Audible Magic, and other audio-fingerprinting systems attempt to solve is to recognize that accurately and efficiently, even for a huge collection of music. Clearly this is an audio-only situation.
Category 2: Two distinct situations are possible. Category 2a essentially means playing back the original recording and recording the playback; this is what the better-known version of Shazam available to individuals is designed for. It does the same thing as the "IPR" version, but in the presence of noise introduced by the environment and the transmission channel; for example, music played on a jukebox in a crowded bar and transmitted to a server via a mobile phones low-fidelity microphone. Category 2b, on the other hand, involves comparing different original recordings of the same performance; this situation is much less well-known, but it's by no means contrived. One example would be different mixes of the same studio "take". Also, there are probably rock concerts for which dozens or even hundreds of recordings exist, albeit nearly all low fidelity (and illegal, except for bands like the Grateful Dead that allow them!). Finally, there are surviving examples of early recordings made simultaneously on two or more masters fed by different microphones. (This leads -- accidentally -- to stereo recordings. At least one, a 1932 Duke Ellington session, has been released commercially in such a version.) There's an analogous situation with events, but practical applications are likely to be rare.
Categories 6 and 7: The boundary between these two important categories is difficult to draw. Johnny Cash's cover of the Nine Inch Nails song "Hurt" is clearly in Category 6, as are most of Mozart's Variations on Ah! Vous Dirai-je, Maman (a.k.a. "Twinkle, Twinkle, Little Star"); Jose Feliciano's version of The Doors' "Light My Fire" is definitely in 7. But what about instrumental versions that closely follow the melody, harmony, and form of a song, (e.g., George Winston's solo piano version of Light My Fire or the Kronos Quartet cover of Purple Haze), or vice-versa (the song "Stranger in Paradise", based on Borodin's Polovetsian Dances)? Finally, one of the two Guess Who versions of Light My Fire available via iTunes in early 2007 is extremely similar to the original, except it reduces the very salient 5-min. instrumental interlude to just a few seconds. Is that a "conservative" cover or not?
Categories 8 and 9: Doing well on these tasks undoubtedly requires human-level intelligence.
Acknowledgements
Michael Casey's comments on an earlier version of this document clarified my thinking considerably and led to major improvements in it. Tim Crawford made his own version of my original table with some thought-provoking differences. Jeremy Pickens pointed out a number of things that needed clarification. In addition, Ed Wolf and other members of my spring 2007 Music Representation and Retrieval seminar made some very helpful comments. My thanks to all of them.
References