|Context||Columbia Neural Network Reading Group/Seminar Series|
Music is complicated - there is complex structure you're trying to understand from the audio. This often results in very large, complex models, which benefit from lots of data to avoid overfitting. Unfortunately, there isn't a huge amount of data for many of the low-level information we're trying to extract from music - large-scale well-annotated data is hard to come by.
One way to gain this data is to synthetically create many “views” of the same concept. Data augmentation artificially creates additional training data based on the available data such that the label is preserved - such as desaturating, overexposing, rotating, etc. images, which won't change a label describing what objects are present in the image. The test data remains unchanged, and this typically helps performance.
When deforming music, the label may change - for example, pitch shifting the audio may change chord labels or key. Timing stretch may move annotation boundaries for chords, onsets, beats, etc. Adding noise likely won't change any labels. So, it's not always so straightforward to apply data augmentation - you need to be careful to change both the original data, and potentially the labels.
JAMS is the “JSON annotated music specification” - a JSON schema which defines how you encode your annotations. It makes it simple to package up a bunch of information about a song in one place. The original version of JAMS lacked a unified, cross-task interface. Now, in the most recent version of JAMS, the files are “annotation major” - every annotation has the same exact structure, with labels of “time”, “duration”, “value” and “confidence”. Each task has a schema, which allows a given annotation to be validated. Each task can have a different namespace for different annotation styles.
Given an input JAMS file, with a corresponding audio file, a “deformation” object takes on a certain state corresponding to what deformations it will apply. It then applies augmentation to the audio and updates the JAMS files and outputs different JAMS, for each state. Each deformation object knows what namespaces (tasks) it deforms annotations in. Deformations are composable. The state of the deformation object which was used for each state is also stored in the JAMS file. “Pipelines” are available for composing deformations, which allows for combinatorially more deformations.
This approach was tested on an instrument recognition task, on the MedleyDB dataset. A convolutional network, with two convolutional layers and two max-pooling layers with a sigmoidal output, was trained to predict whether an instrument was present on 1 second snippets from MedleyDB. The data was split into 15 artist-conditional random splits, with a 4:1 train/test ratio. Combinations of pitch, time stretch, background noise, and dynamic range compression augmentations were applied. It looks like pitch shifting helps a lot, but adding in the other augmentations after that didn't have a large additional benefit. Looking at the change in F1-score on a per-instrument basis, it appears to help in general, except for synthesizer, female singer and violin. The synthesizer is not a well-defined instrument, but applying time stretching likely distorts vibrato to an extent that it hurts accuracy.