|Authors||Juan Pablo Bello, Laurent Daudet, Samer Abdallah, Chris Duxbury, Mike Davies, and Mark B. Sandler,|
|Publication Info||Retrieved here|
Discusses, evaluates, and improves upon a number of common onset detection techniques.
Music is fundamentally event-based so automatically segmenting music by events allows for smarter processing and analysis. Notes can be thought of as having an attack portion, which is the initial time interval where the amplitude increases. A transient is any short interval where the signal evolves in a nontrivial/unpredictable way; transients should be at least 10ms apart. A note's onset is the very beginning of the transient or attack. Music signals are typically polyphonic audio signals, which are additive and oscillatory so changes in the signal are typically searched for in an intermediate signal (a detection function) which represents the local structure of the audio signal. The detection function is typically derived from a pre-processing audio signal, and a peak-picking algorithm is applied to locate the audio signal's onsets.
Preprocessing the audio signal attempts to accentuate or attenuate various aspects of the signal according to their relevance to onset detection. One typical preprocessing technique for onset detection involves splitting the signal into multiple bands. This may be needed in a multiple-agent architecture, or it may be used to separately process different frequency bands (using potentially different onset detection algorithms) and combine the results by frequency. The signal may also be preprocessed by separating transient and steady state components of the signal. When representing a signal as a sum of sinusoids, the residual is typically modeled as filtered Gaussian noise, and may reveal information about where the signal is acting unpredictably. The phase vocoder may be used to classify individual frequency bins by the predictability of their phase components. The dominant coefficients of the MDCT may also be treated as the tonal part of the signal.
Detection functions are normally calculated based on predefined signal features or probabilistic techniques.
A simple feature is the signal's amplitude envelope, which can be found by rectifying and low-pass filtering the signal, and the energy envelope which squares rather than rectifies the signal. These features are typically not refined enough for peak picking, so the derivative of the envelope is commonly taken to accentuate sudden increases in energy. The first order difference of the log of the energy may also be used in order to better mimic how humans perceive energy.
Features based on the spectrum (STFT) of the signal are also commonly used. A simple measure which helps observe energy increases due to transients is the amount of high frequency energy in the signal, which may be calculated by linearly weighting each bin's contribution in proportion to its frequency. This high frequency content (HFC) feature is best used with percussive onsets where transients are well modeled as bursts of white noise. To model the temporal evolution of the signal, the piecewise linear approximation of the magnitude profile of the spectrum can be calculated and the linear approximation's parameters can be used to form a multi-band detection function. Similar approaches formulate the detection function based on the distance between successive spectra, either by taking the L1 norm of the difference between magnitude spectra or the L2 norm of the rectified difference. The phase may also be used by determining whether the instantaneous frequency remains approximately constant over adjacent windows; the phase deviation can be summed across frequency bands and used as the detection function. Both the phase and amplitude of successive spectra can be used by calculating the Euclidian distance between the observed spectra and the spectra predicted by the previous spectrum.
Time-frequency representations are also used to derive the detection function. One approach finds the dissimilarity between subsequent feature vectors calculated from the result of convolving the Wigner-Ville TFR of a discretized Cohen's class TFR with a Gaussian kernel. The dyadic wavelet decomposition of the signal's residual can be used with good time localization. The significance of the transform's coefficients can be quantified by a regularity modulus which can act as an onset detection function.
The deviation of a probability model from the true characteristics of the audio signal can represent the novelty of the signal. One approach uses the sequential probability ration test, which assumes that the signal samples are generated from one of two models and calculates the log-likelihood ratio of the probability density functions associated with each model. If the short-time average of the log-likelihood ratio changes sign, then the signal has switched probability models. When the models are unknown, one model can be fitted to a growing window starting from the last detection point and the other can be estimated from a window of fixed size both using Gaussian autoregressive models. The fixed length window can also be divided into two windows about a change point, and each of the windows are modeled using separate Gaussian autoregressive models. Change points are then found when the the log likelihood of a change at the split point vs. no change passes a threshold. Rather than looking for a change between models, the negative log probability of the signal given its recent history (given a global model) can be used as a detection function. In this way, the signal can be considered as a multivariate random process where each variable is a vector of previous audio samples and the expectations about a vector can be expressed as a probability model based on the previous frames and the “surprise” given the observation of a new frame can be defined as the negative log of the probability of observing the new frame based on previous frames. The surprise may also be defined as the probability of one segment of the frame given the probability of the other segment and the probability of the frame itself, both of which can be estimated using a joint density model such as two separate independent component analysis (ICA) models. In ICA, the signal frame is assumed to be generated by a linear transformation of a random vector of independent non-gaussian components. Some of these probability models simplify to signal feature models given certain functions about the probability distributions.
Temporal methods often fail when faced with amplitude modulated or overlapping sounds. The high frequency content metric can emphasize the percussiveness of a signal but is less robust with low-pitched and non percussive events and when onsets are masked by high frequency sounds. Phase-based techniques have difficulty in complex musical recordings whose phase may be distorted. Time-frequency approaches can have much better time resolution but may be much less smooth, requiring post-processing. Probability-based models can provide a more theoretically sound detection function but require expensive and time-consuming training.
The onset detection function should give rise to a signal with local maxima at possible onset locations but will likely have a good deal of noise, so an algorithm must be used to differentiate maxima corresponding to onsets and spurious ones. Some detection functions are post-processed to make thresholding and peak picking easier, usually by smoothing, normalizing, and removing any DC component. Thresholding is done to ignore maxima which are not substantial enough to suggest an onset. A fixed threshold is normally not sensitive to slow changes in the audio signal's loudness so an adaptive threshold is normally used which is calculated as a smoothed version of the detection function. This smoothing may be a linear FIR filter, squaring and applying a smoothing window, or using a sliding median filter. Peak picking can be done in a large number of ways but depending on the post processing it can be as simple as choosing all maxima above the threshold.
The HFC, spectral difference, phase deviation distribution spread, wavelet regularity modulus, and negative log likelihood detection functions were compared by calculating each function on a dataset of music (split into pitched non percussive, pitched percussive, non pitched percussive, and complex mix) with labeled onsets. Onsets were derived by using a median filter-based thresholding and peak picking technique.
By fixing all parameters while adjusting the peak picking threshold, it was found that the negative log likelihood feature performed the best with the HFC content only having slightly fewer true positives and more false positives. The HFC feature also fared poorly with pitched non-percussive sounds, while the negative log likelihood feature performed somewhat well with most audio types. The results may have been different with a different dataset or using the “best” peak picking algorithm for each detection function.