Proposes two beat tracking methods based on a neural network-generated beat activation function.
Beat tracking is easy for humans but hard for machines, but would be useful for synchronization or rhythmically-aware processing. Most machine-based beat trackers first extract some relevant perceptually-related features (onsets, chord changes, amplitude envelopes, spectral) from the signal, then attempt to estimate their periodicity (tempo) and phase (beat locations). Periodicity estimation is usually done with autocorrelation, comb filters, histograms, or multiple agents, some of which also provide a phase estimation.
Neural Networks (NNs) are used to model complex relationships between a set of inputs and a set of outputs. The most basic is the Multi-Layer Perceptron (MLP) which forms a Feed-forward Neural Network (FNN), where input values are fed through hidden layers consisting of neurons with non-linear activation functions, resulting in a causal system. In a Recurrent Neural Network (RNN), cyclic connections in the hidden layers are allowed, which may cause input values to decay or blow up exponentially over time. Long Short-Term Memory (LSTM) attempt to prevent this by adding recurrent connections and a “forget” gate to act as a memory cell (from Wikipedia: “Unlike traditional RNNs, an LSTM network is well-suited to learn from experience to classify and process and predict time series when there are very long time lags of unknown size between important events.”). To adapt these NNs to consider both past and future values, a fixed time window or a delay can be added to the input. Bidirectional Recurrent Neural Networks (BRNNs) add a second set of hidden layers which are used to process the signal backwards, so that the network has access to past and future values. Bidirectional Long Short-Term Memory (BLSTM) RNNs similarly add the second “backwards” set of hidden layers to the LSTM network and can model any temporal context around a given input value (making them appropriate for beat and onset detection).
To create a beat activation function, three parallel STFTs are calculated with different window lengths and their first-order difference is fed into the BLSTM network, whose output is fed into a peak detection function. The input signal is segmented into 23.2, 46.4, and 92.8 ms Hamming-windowed frames taken every 10 ms, and the power spectrum of each frame is found. Each spectrum is converted into a Mel spectrum with 20 triangular filters. Because large windows cause peaks in energy to occur before the actual beat, the first order median difference is calculated - that is, the half-wave rectified difference of the median spectrum over the last 410 ms and the current spectrum. Both the mel-spectrums and median difference spectrums of each frame size are used as inputs to the BLSTMRNN.
The network has three hidden layers in each direction, each with 25 LSTM units, and two outputs which represent the probability of “beat” vs. “no beat”. The network was trained using about 30 minutes of music with about 3600 hand-annotated beats. Gradient descent with backpropagation of the errors is used to train the network; once the performance on a 15% randomly-chosen validation set does not improve after 20 iterations, the training is stopped. Five different networks are trained in this way, and their outputs are averaged and used as the input to the peak detection stage. This output could be thresholded to find beats, but a more sophisticated approach where the tempo is assumed to be constant over a certain period of time can prevent false positive and negative beats.
Two different constant-tempo windows are used to create two beat trackers: The duration of the entire song for BeatDetector and six seconds for BeatTracker. The autocorrelation of the beat activation function over this window, with lags from 273 ms to 1.5 s (corresponding to tempi of 40 to 220 bpm), is calculated and smoothed using a Hamming window of size 150 ms. The tempo is chosen to be the highest peak in the smoothed autocorrelation function and its period is used as the beat interval. The first beat location is set to the highest value of the beat activation function's sum at all possible beat locations for the tempo interval. Each subsequent beat location is chosen as the highest value of the beat activation function at the possible beat positions for a given interval. The final beat function is set to 1 within 10% of a beat interval around each beat and 0 elsewhere.
This algorithm was submitted to the 2010 MIREX beat tracking evaluation and was found to be the best performing algorithm for a dataset consisting pieces with roughly constant tempo, with the BeatDetector performing slightly better than BeatTracker. In a dataset with widely varying tempo, the BeatTracker performed second best, and significantly better than BeatDetector. Notably neither algorithm was trained on a dataset with such widely varying tempo, but the BeatTracker still performed well in that context. The algorithm may be improved some by trying different peak picking techniques.