Notes from the 2010 MIR workshop at CCRMA (titled “Intelligent Audio Systems: Foundations and Applications of Music Information Retrieval”), which took place from July 12-16. The main wiki page for it is here with more information. These notes are in chronological order, and could probably use a LITTLE clean-up.
Stands for “Music Information Retrieval”. Extract information (features, semantics, quality, similarity) from audio.
Pitch/rhythm tracking: Guitar hero, BMAT Score, JamLegend. DAWs with beat/pitch tracking: Ableton, Melodyne, Mixed in Key. Music creation software: Khush, UJAM, SongSmith, VoiceBand. Audio search/Query By Humming: SoundHound (previously MiDoMi). Fingerprinting: SoundHoud, Shazam. Music Recommendation: Gracenote, EchoNest, BMAT, Bach Technology, MoodAgent. Assisted Music Transcription: Transcribe!, TwelveKeys Music Trancription Assistant, Zenph. Restoration: Rather than restoring recordings, detect notes/pitch and transcribe.
Don Byrd and Tim Crawford jointly received funding for music recognition/search as well as a MIR workshop in 1999; Downie, Huron and Nevill host MIR workshop at ACM DL/SIGIR 99; Crawford and Boehm organize MIR workshop in in 99. In 2000, ISMIR is hosted; in 2008 ISMIR is incorporated as a society; MIREX benchmarking begins in 2005. Other conferences: ICMC, ACM MM, DAFx, ICASSP, NIME, AES, ACM IR, ICML, ICMPC, JCDL; journals: Computer music journal, journal of new music research, IEEE SP, IEEE Acoustics, Speach, SP.
First, take an audio file and perform some segmentation on it: chop it up into frames, look at onsets, beats, bars, chord changes, etc, depending on the problem. Frames often call for some overlap. Onsets are noticeable changes corresponding to the beginning of a sound event; in order to do good onset detection we need to examine different frequency bands. Beat detection is good for rhythmic music; it requires onset detection and periodicity measurements. Second, perform feature extraction - get a variety of measurements on pieces of audio. One basic features is zero-crossing rate, the number of times zero is crossed in the frame. This is enough to classify snare vs. bass drum, for example, using ZCR a an indication. In the frequency domain, we can derive, for example, the spectral moments: centroid (brightness), bandwidth/spread (std dev around centroid), skewness (measure of the symmetry of the distribution), and kurtosis (peakiness). Also, we can find the spectral tilt (least squares linear fit) and roll-off (at which frequency is N% of the spectral energy contained below?). The collection of the features in the frame is referred to as the feature vector. Then, perform some analysis on this information and make a decision about the content (classifying/clustering). On a simple level, for a single feature, the features can be sorted - eg, for zero-crossing, we have pitch/noisiness.
Any supervised learning algorithm is going to rely on there being a training phase, in which the human building the system gives the algorithm some set of data as a clue as to what it’s supposed to learn (eg, positive and negative examples for binary). In many cases, you have to pre-package the classifier - eg, you can’t always use data in the set in question as training data. Each data point is a vector, eg of features. We need to develop some notion of similarity - maybe the Euclidean distance between two features vector points - if its close to other data points of a class, we can call it that class. k-NN finds the k nearest points in the training set, then the majority rules (so k is normally odd, often chosen to be near the square root of N, though the choice does matter). It is very simple to train (just store the values), but classification gets complex with a lot of training data because a lot of computation is needed for so many Euclidian distance measures in high-dimensional space. It can also be “overfit”, which means that it can’t generalize to new training values it’s not used to. Euclidian distance is also very sensitive to normalization in the feature vector values, so you must normalize (either directly or via a normal distribution) if you want all of your features to be equally weighted.
Rise time or attack time is the time between the onset (basically change of sign, or the low point before the max) and maximal amplitude point, based on the envelope of the signal. More common to take the log(attack time). Measures percussiveness. Also allows us to determine the attack slope (slope between minimum and max). Temporal centroid tells us how much information is concentrated at the beginning of the frame vs. the end of the frame - measures impulsiness or decay amount/type. Similar to attack time, but also measures frequency.
Energy per (third) octave band, which is informed by human hearing. MFCC tries to match human perception more exactly. Good for identifying different phonemes, and also timbre. Calculated by finding the FFT (or STFT), taking log(STFT), performing mel-scaling to group and smooth coefficients, and then decorrelate with DCT (which is all real, you don’t have to deal with phase). Using 12 coefficients is the standard, or 13th with “zeroth coefficient” or the entire energy of the FFT (bin zero). Also helpful for instrument recognition as it creates a “template” for the instrument. Delta MFCC and delta delta MFCCs are calculated, to see how fast the MFCCs are changing, and how quickly the change is happening. Spectral flux is the distance between the spectrum of successive frames, which is a one dimensional feature. Spectral flatness measure is a sub-band based meaure determining the number of peaks, and their size, in the band. Perceptually tells you how “tone-like” a sound is - how many peaks are there. Used in fingerprinting algorithms, “tonality coefficient”. Robust to artifacts like compression, reverb, etc. Spectral crest factor is the ratio between the highest peak detected and the RMS value of the spectrum.
The features you choose should depend on your domain of knowledge. Choosing descriminating features is better - smaller is better, you get simpler models, faster training, and you need less training data.
We can divide the spectrum arbitrarily according to perception. By taking the log frequency, an octave is a distance which is constant throughout the whole spectrum, which matches our notion of an octave being an equivalent musical interval regardless of frequency. We can also describe the process of note intervals as a chroma spiral, with the height corresponding to pitch. We can then look at chroma bins as the amount of C, C#, D, in the signal, called the chromagram. We can then cross-correlate the chromagram with a template according to each key, and then look at the height and find the maximum height.
Make decisions based on thresholded levels of features - a decision tree algorithm. If feature 1 is less than x, it’s +, if it’s greater and feature 2 is greater than y, it’s +, etc. Recursive fitting. This tree is made based on the training data. This can lead to overfitting. A stump only uses a single decision. Common algorithms are CART, ID3, C4.5 and J48. It’s easy to implement and the decision boundary is explicit and straightforward. However, it can take a long time to learn and finding an optimal tree can be NP-complete. Slight perturbations can also lead to very different trees.
A meta-algorithm for creating stronger learners (robust; able to classify data with great accuracy) from many weak learners (not robust; “base” learners). The most well-known and used is adaboost, which is robust to overfitting and tries to maximize the margin. Don’t want the base learners to take a long time to learn, and you want them to perform better than randomly. The algorithm starts with D1, which is the dataset with all examples equally weighted, then for t from 1 to T, train the weak learner ht on the dataset Dt; if ht can’t achieve 50% accuracy, stop; otherwise choose alphat according to the error rate of ht on Dt; then update the data weights Dt+1 to increase the weight of examples ht got wrong and decrease the weight of examples ht got right. In other words, in each iteration, the highly weighted data points are the points that every other weak learner has gotten wrong. The classifier then is the set of weak classifiers ht weighted by its alphat. Each classifier is generated at each step based on the dataset and its weighting, and as long as they’re 50% or better, you’re all set. They just have to support the ability to weight each datapoint. Could be decision stumps, pruned decision trees, circles or hyperspheres, etc. Maximizing the margin refers to the fact that you want no bias towards classifying something as positive or negative (the boundary is equally far from the closest positive/negative example). Adaboost is nice because it maximizes the margin even when the training error is zero, and it doesn’t start to overfit. It’s also conceptually simple, and works with any base learner with no parameters to tune. However, the weak learner must achieve >50%, and it requires modification for non-binary classification.
At the simplest level, you have some hidden test set, and you run your classifier on it and determine how many times it gets the classification correct. You can also get a feel for this from your training data by using cross-validation. For 10-fold cross-validation, you divide the test set into 10 random subsets, and one test set is tested using the classifier trained on the remaining nine. This is repeated for each segmentation of the data, and the cross-validation is the average of all the accuracies. This is helpful, but if you optimize too much on cross-validation, you can severely overfit. It can be used to determine what parameters are used in the classifier, as well as which features you use in the classifier. Stratified cross-validation is similar, except that the folds are chosen so that they contain equal proportions of labels for the data points.
Unsupervised learning - clustering - is finding pockets of data and clustering them together. This is compared to supervised learning, where a human assign labels. K-means groups a bunch of data into clusters according to their square distance. The aim is to divide the data into groups such that all the points in the group are as close as possible together while being as far as possible from the other groups. “Hard” clustering means that each point gets assigned rigidly to only one group (can’t be a mix of groups). The cluster centers are randomly placed, and the points are clustered based on which center they are closest to. Then, the means of all the points in a cluster is found, and the center is moved to that spot. Then, the points are re-clustered based on their closest center. This continues until the cluster centers move only a slight amount. The random locations DO matter, because the algorithm can converge faster or slower depending on its location. To solve this, we can run the k-means algorithm multiple times and use the clustering that minimizes the variance. The algorithm is guaranteed to converge eventually. Can also choose initial centers based on data points with extreme values, or by finding the mean for the whole data set then perturb into k means.
Compact representation of joint probability distributions, and tools for reasoning about conditional independence and dependence - combination of graph and probability theory. Markov random fields are undirected, Bayesian Belief Networks are directed and more common in MIR. Ideas come from a Bayesian (encoding your knowledge of how likely things are to happen), used to propose a model that could explain a real-world phenomenon. Using a graphical representation is more efficient than writing out the entire factorization of all joint probabilities. There are also simple graphical structures which pop up in various domains.
Random variables are functions that map events to numbers - eg drawing a card to a binary 1 or 0. They’re represented by a letter with a circle around it in graphical models. Distributions assigns probability to regions of a sample space. For a normal distribution proportionality for RV Z, we can write Z ~ N(mean, std. dev). The probability then is the area under the distribution curve. Common distributions are Gaussian/Normal, binomial, beta, gamma, multinomial, Bernoulli, Dirichlet, uniform. Any distribution is specified by the type and its parameters. There may be a “prior” distribution over a distribution’s parameters. Distribution are not explicitly represented in the graphical notation, but will be described in the text; they are represented by a dot with a circle pointing to an RV. Outcomes are not represented in the graph, and are normally written as lowercase. Conditional distributions are the probability of an event X when the value of another RV Y is known. Conditional relationships are represented as lines pointing from one RV to another. If RV’s are independent, then the occurance of one event does not alter the probability of the other event - so the probability of both is the product of the events’ probabilities. If they’re dependent, given the knowledge of one varies the probability of the other. Independence can change when additional events are added - if something is observed, it can mean that two variables are no longer dependent. Observed variables whose values are known are shaded.
Naive Bayes assumption is a generative model, where a class C is drawn from a prior distribution, and each feature is drawn from a distribution conditional on C, with each feature independent of one another - P(F, G, H, … | C ) = P(F | C)P(G | C)P(H | C)… This rarely matches reality, but is one way of representing how data came to be in a probabilistic framework. Drawn as a single class C with arrows leading to a bunch of separate variables - sometimes the separate variables are represented as a “plate”, where you have Fn with a box around it with an N in it. To clasify, we look at the observed features and model parameters, which involves knowing the distributions or finding them which maximizes the likelihood of the training data.
Gaussian mixture models are like K-means where you can create clusters, where the clusters were generated by a normal. The generation variable is latent/implicit, it’s not observed.
Hidden Markov models are similar in that they have some latent variable, but the latent variable encodes some notion of a state. Infer the most likely sequence, predict the most likely next observation. The word “hidden” comes from the fact that the variables are not explicitly observed in the real world. Inference is finding parameters which maximize the likelihood of the data. Often you maximize the log likelihood because the multiplication of terms changes to a summation. For simple models like Naive Bayes, the maximum likelihood is simple, but it has to be generated for new model architectures. Inference methods can be intensive or get stuck in local maxima.
To assign semantic tags to audio, build a model for joint distribution of tags and audio features, then use that model to compute the probability that a tag applies to a certain song. This allows for annotation (new tagging) or retrieval (finding appropriate). Starting with a 39-dimensional MFCC-delta feature vector, they vector quantize all feature vectors in terms of codewords (how many times does a codeword appear in a song). For each song in a collection, and for each possible tag, draw a codeword from the song, draw a bernoulli for that codeword and tag pair, and apply the tag if and only if the bernoulli is true. The Bernoulli’s came from some training data.
Some provide audio databases, with hopefully some metadata about instruments, etc. Others just provide features, to avoid copyright problems.
K-means result in hard clusters - each data point must fall into one cluster or another. GMMs are soft clusters, so they can be a mix of clusters. You specify how many components (centers) there are, and what is returned is the variance from the centers. GMMs are good because they can approximate any probability distribution with enough components, they compress data, and they allow us to make “softer” decisions later on, via whatever method. In addition to specifying # of components, you tell it roughly the importance of each cluster (the mixture coefficients or priors), and initialized centers, and variances. It returns the centers/means, variances, mixture coefficients, posterior probabilities (the responsibility which the Gaussian components have for each data point). Once you have a GMM, you have a description of probability distributions in a feature space, so you can take any new feature vector and calculate the probability that it lies in one mixture or another easily. Each GMM is “per-classifier”, eg “per-genre” or artist, and it basically describes (via mixtures of Gaussians) the space that the classifier inhabits. The variance can be calculated in a variety of “shapes”, eg spherical covariance - one number for the variance across all features, or a diagonal covariance matrix - where the variance for one datapoint dimension is calculated independently, or full covariance - where you look at the variance between how each axis varies with eachother, or pooled covariance - where a single covariance is used to describe all clusters. Diagonal covariance has the best trade off of # of components vs. compute time for matrices.
Using good training examples to build classes is very important, as is a good feature vector. Choose a decision-making algorithm depending on what you want out of the system, and use cross-validation to evaluate its performance before using it on test data a whole lot. It's also valuable to look at your data as much as possible - look at means, max, min, std, NaN, Inf, etc.
Which line (or boundary or hyperplane) is the best fit to separate a bunch of data? THe margin is the maximum width of the line before it hits a data point. The support vectors are the data points which push up against the margin (thoe on the edge). SVM training is guaranteed to find the global minimum of the cost function - the minimum error vs. how important those errors are. This allows the SVM to generalize, as opposed to just fit the input data. The separation line/plane need not be a straight line - it depends on the kernel. Radial basis function will allow you to choose the degree of curviness (gamma) and the cost (allowance of points to be mis-classified). To find the optimum gamma and C, you can use a grid-search using cross-validation; first doing coarse, then fine. It is brute force, but it works well if you have the time (if you don't, you can sub-sample the data).
Precision: Gives you an understanding of how relevant the retrieved results are, or the true positives over the number of true positives + false positives (the total number of retrieved documents). In order to decide if something is match (true or false positive), some kind of threshold must be made. The recall is how complete the retrieved results are - the number of true positives over the number of true positives + false negatives. A measure of how much accuracy you got vs. how much accuracy you SHOULD have gotten. If you care about precision and recall equally, you can calculate the F-measure, which is the harmonic mean of precision and recall: 2PR/(P+R). You may not always care about them equally - false negatives may be very important, for example. ROC graphs are the “receiver operating characteristic” curves - a plot of false positive vs. true positive rates, so that the more “northwest” is better. Anything below the median line is “worse than random”. Principal component analysis is an orthogonal linear transformation of the features space into a new coordinate system. The first principal component describes the direction of the greatest variance, second principal component is next-highest, etc. The direction is not depedent on clas, it is entirely based on the variance. In this way, you can rank the features depending on how well they capture the data. It assumes that there’s a high SNR. Linear discriminant analysis reduces dimensionality while preserving structure useful for class discrimination. This project onto a line which maximizes class separability. This is done by optimization over the difference of clas means, normalized by measure of within-class scatter. In order for LDA to work, the underlying class distributions must be gaussian and the discriminatory information of the classes is in the variance, not the mean.
One meta-method is to choose subsets of features to test, and pick one with the best accuracy (eg, with cross-validation). The subset choosing can be doing in a greedy way as well. The difficulty with this method is that noise in the estimator may overwhelm real-world effects. Often the training accuracy (by cross-validation) is different from test accuracy. Filter-based methods do not repeatedly train classifiers, instead they look at the individual predictiveness/goodness of each feature, using information gain methods. Feature selection can be very expensive and searching through the whole feature space may not be useful.
If you measure the cross-correlation of features between various frames, and plot as a matrix, you should see a checkerboard pattern with a granularity corresponding to locations of repetition, etc. (similarity matrix). Dark squares are “novel regions”. You can visualize key progression based on triangular plots called key-scapes, which map how the key changes over the piece. At the bottom of the triangle, the per-frame key is shown, while at the top, the entire composition’s key is visualized, as color.