|Context||Columbia EESIP Seminars|
In the past, music similarity revolved around the radio, magazines, and music stores. Once music started being shared digitally, the amount of music available exploded. Most people don't discover music through online stores - instead, it's more human-powered via sharing with friends on social networks. However, human-powered recommendation doesn't scale to the huge number of songs available. A common research direction, then, is “Google for Music” - where you can retrieve the kind of music you're looking for based on a query. Google actually did this, but it was with a text search (via metadata). Instead, it would be useful to be able to do a “more like this search” where a query piece of audio was used and the recommendation is entirely content-based. Pandora follows this paradigm - you tell it the artists/songs you like, and it retrieves songs. There are two paradigms for listening to music explored here - archive browsing, where you have a lot of music and you want a ranked list of songs, and passive listening, where you automatically generate a playlist with much more limited feedback.
When defining the similarity between two pieces of music, people often just look at text tags - not looking at the audio at all. It can be hard to choose a tag vocabulary, because human tagging may be noisy and it can be hard to decide how “big” the vocabulary should be. This makes it hard to evaluate similarity in this paradigm. If, instead, you just ask what is similar based on human opinions, you can also run into ambiguity and subjectivity problems, and it's also extremely expensive to gather this data. A more scaleable approach would be to look at the listening history of a huge group of users, and define song similarity based on the overlap of different users. This notion of similarity has worked well in the context of tagging, playlisting, and recommendation. Here, the feedback is implicit - it requires no additional effort from the user - it's just based on their listening pattern. However, it suffers from the “cold start problem” where the system doesn't know what to do with new songs. This becomes important when a service needs to be able to recommend novel and new music.
Instead of just looking at listening patterns, it would be useful to be able to predict the collaborative filtering output using only the audio content. This involves extracting features from the audio and learning the similarity based on what the collaborative filtering would predict. One approach is to use metric learning to warp the audio feature space such that similar songs end up “close together”. Then, the ranking can be predicted by maximizing the differences in distances between a query and the different feature representations of each audio signal. More formally, the score for the target ranking should be higher than the score for all other rankings by at least a prediction error. There are a factorial number of possible rankings, so instead an approximation is used where only the worst (highest “wrong” ranking) is optimized.
In order to do the learning, the audio waveform needs to be transformed to a d-dimensional space. Here, the audio signal is framed and the MFCC, delta, and delta-delta MFCCs is found for all frames. These frames are vector quantized to give a histogram of codewords (feature vectors). The probability product kernel is then found, which makes the count of the small count items larger. The metric learning is then done on the PPKs, using collaborative filtering information. Here, listening information for 360,000 users and 186,000 artists, a small subset of which had audio data. The performance of this system was found via cross-validation in comparison to common approaches using GMM-based similarity and auto tagging. It was found that using the metric learning helped in all cases, and learning on vector quantized audio features did almost as well as expert tagging. Doing MLR on the expert tags did the best.