# http://colinraffel.com/wiki/

### Site Tools

deep_learning_and_feature_learning_for_music_information_retrieval

# Deep learning and feature learning for music information retrieval (talk)

 Presenter Sander Dieleman Context Columbia Neural Network Reading Group/Seminar Series Date 7/23/14

This talk covers a variety of experiments/papers in the general theme of deep learning for MIR.

## Multiscale Music Audio Feature Learning

Multiscale approaches to music audio feature learning. Sander Dieleman, Benjamin Schrauwen, ISMIR 2013

Feature learning is receiving more attention from the music information retrieval, inspired by good results in speech recognition and computer vision. Music exhibits structure on many different timescales, including periodic waveforms, repeating motifs in music, themes and melodies, and at the highest level musical form/structure. Knowledge about this structure can be exploited while learning unsupervised features for MIR tasks.

One approach is spherical K-means, where the means lie on the unit sphere and have a unit L2 norm. It's a relatively simple algorithm and only has one parameter to tune (the number of clusters to learn $K$), and is much faster than RBMs, autoencoders, sparse coding, etc. and turns out to work almost as well. When learning the clusters, each point is assigned to only one cluster; when extracting features a linear combination of the features is used (a convolution operation).

One way to create a multiscale representation of audio is to compute spectrograms with different window sizes. When the window size is larger, each spectrogram sees more of the signal so the time resolution suffers, although seeing a larger amount of the signal improves frequency resolution. Another way to get a multiscale resolution is to compute the spectrogram at some fixed window size, and smooth and downsample repeatedly to obtain multiscale representations (Gaussian Pyramid). Finally, the Laplacian pyramid simply takes the difference between levels of the Gaussian pyramid. Once the mid-level representation is obtained, windows of the spectrogram are extracted, PCA whitened, and K-means is used to extract features. The K-means features for each window are pooled together to get “summary” features.

To test this approach, an autotagging experiment on the Magnatagatune dataset was performed. This dataset includes features and tags for a large semantic space, with a large variety of different types of music. Only the fifty most common tags were used. A multilayer perceptron was used for classification. Features used included the raw multiresolution spectrogram, Gaussian pyramid features over 1, 2, and 4 frames and Laplacian pyramid features over 1, 2, and 4 frames. Feature learning was performed either with the raw whitened PCA or K-means with a variety of settings of $K$. The Laplacian pyramid over 1 frame worked best. A separate experiment was run where all timescales or one of the timescales was run; different tags performed better with different timescales. Using all timescale levels always worked the best.

## Deep Content-Based Music Recommendation

Deep content-based music recommendation. Aaron van den Oord, Sander Dieleman, Benjamin Schrauwen, NIPS 2013

Music is largely distributed digitally today, and there is a particularly long tail for music (a few very popular and most are unpopular - lots of niche items and subgenres). One way to do music recommendation is to use collaborative filtering - using listening patterns from users to determine which songs are similar. It tends to perform quite well if the listening data is available. The downside is that you can only use collaborative filtering when you have listening information, which you won't have for new music or niche items (cold start problem). A different approach is to use the metadata and audio content (content-based) to perform recommendation, which currently doesn't work as well but doesn't suffer the cold start problem. A difficulty with content-based recommendation is that there's a big “gap” between the audio signal and higher-level characteristics, like mood, genre, etc.

The most popular way to do collaborative filtering is to use matrix factorization, where you construct a matrix $R$ with users as rows and songs as columns, where $R_{ij}$ is nonz if user $i$ listened to song $j$. The factorization then represents latent factors for users and songs. For some problems, a better approach is to use a weighted factorization, because if a user has listened to a song it's a strong positive signal but if a user hasn't listened to a song it's a weak negative signal. A weighed matrix factorization therefore uses a confidence matrix which weights the strong positive signal higher.

The goal of this work is to predict the latent factors directly from the audio signal using a regression model. The model used was a convolutional neural network. Audio data was represented by 3s spectrograms. The neural network's input had six-frame-width filters slid along the spectrogram with max-pooling over four frames. Then, the same thing is repeated on top of that until a representation with only five time steps is obtained, which is flattened by summing and a dense neural net layer which predicts the latent factors.

An experiment was run using the million song dataset, which has listening data for 1.1m users for 380k of the songs. Since the MSD doesn't include audio data, short clips were downloaded from 7digital which got about 99% of the 380k songs. On a smaller subset, models were trained including metric learning to rank, linear regression, an MLP, and convolutional neural nets with mean squared error and weighted prediction error functions. The convolutional neural nets performed best, with the mean squared error function working better than weighted prediction error. On the full dataset, the convolutional neural net performed even better than a linear regression model although there's still a gap between the CNN performance and the latent factor model upper bound. The latent factor model tends to predict songs by the same artist as similar, whereas the CNN model is more likely to predict songs which are sonically similar. Separately, t-SNE was used to visualize the data, which did a good job clustering roughly by genre.

## End-to-end Learning for Music Audio

End-to-end learning for music audio, Sander Dieleman, Benjamin Schrauwen, ICASSP 2014

The traditional approach to classifying audio (and images and other types of data in general) is to first extract features which we think encode relevant information for the task we're trying to solve, then train a classifier on top of those features. In audio, MFCC or chroma features are often used. In computer vision, people have been getting rid of the feature extraction stage. Inspired by that, people in MIR have been doing similar things, but it's being done on a mid-level time-frequency representation like a spectrogram on top of which features are learned. This is because it makes the data look like images (so we can use image processing tools directly) and that the mid-level representation seems to better represent the important aspects of the audio. In this experiment, learning is attempted using the raw audio data instead.

Convolutional neural nets can learn the features and the classifier simultaneously, so they are used. For comparison, a mid-level representation of a mel-scaled log-magnitude spectrogram is used. Again, the experiment is on the Magnatagatune dataset. Three representations were compared: The mid-level spectrogram, raw audio with strided convolution, and raw audio with strided convolution and feature pooling. A strided convolution is one where you slide your filter along the input but skip (like an STFT). The same convolutional net was built on top of these three representations. Spectrograms still perform better. The learned filters are frequency-selective (like a mel-scaled basis) and noisy. The scaling of the filters looks mel-scaled. To improve the performance of the raw audio data, log non-linearities are used because they are used in the mid-level representation, but this makes the optimization too hard and does not help. The addition of pooling on the filtered audio signal doesn't help, but is expected to introduce phase invariance; this makes the filters less noisy and the pools consist of filters which are shifted versions of each other.

## Transfer Learning by Supervised Pre-Training

Transfer Learning by supervised pre-training for audio-based music classification, Aaron van den Oord, Sander Dieleman, Benjamin Schrauwen, ISMIR 2014

The “Alexnet” is a deep convolutional neural network which won the ImageNet dataset (classifying images by their content) challenge by a landslide. It turns out that the representation in the second-to-last layer (before classification is performed) has a nice feature representations for other problems; people have tried just learning a simple model on top of these features. In this work, features are learned on the million song dataset in a supervised manner, and these features are used for other tasks. The MSD has lots of data for automatic tagging, user listener preference prediction, etc. Using this data, features are learned and are used on other datasets for genre and tag prediction. The tag, listening preference, and genre tasks all differ in their characteristics (multi-label vs single-label, number of tags vs number of users, some tags are missing, some tags are redundant, and some tags only apply to a small subset of songs).

To perform transfer learning, tags are converted to latent factors using WMF. The audio signals for the source task have their spectrogram computed, and K-means is used (for speed reasons) to extract low-level features. From this, an MLP classifier is trained. In the target tasks, spectrograms and k-means features are also computed and features are computed using representations from the target tasks classifier. In genre classification, a linear regression source model seems to perform well; the listening data task MLP features perform worse than the tag data task probably because the listening data task involves overfitting. For the tagging task, the tag source task improves results a great deal.