|Authors||Kris West and Stephen Cox|
|Publication Info||Retrieved here|
Compares different segmentation techniques for classifying audio; proposes an onset-based technique.
There is no unified frame extraction technique for music classification. Some common techniques involve many short frames (overlapping and not), windowed short frames, and a single feature vector per file. Averaging all short frames across a file oversimplifies the representation, while considering all frames results in a feature space which is difficult to model. Long sliding windows run the risk of attempting to model many musical events with a single feature vector. Instead, it might be smart to segment the signal as humans do, in terms of discrete events (via onset detection).
Octave-based spectral contrast features are used, which are like MFCCs except instead of using a mel-scale filter bank, octave-scale filters are used, and instead of summing over the mel-band, the spectral contrast is estimated, which better represents harmonic and non-harmonic components.
All segmentations involve some combination of features in overlapping, hamming-windowed frames. Non-combined 23 ms and 200 ms segmentations are used, as well as the mean and variance of 20 ms frames for 1 second sliding, 10 second non-overlapping windows, and onset-boundary windows. Onset detection functions based on the change in FFT and mel-band magnitudes are used, as well as a function which considers magnitude and phase. To obtain onset locations, an onset detection function is compared to a dynamic threshold based on the onset detection function's moving median. To determine the median window size, onset isolation window size and median size, optimization was used based on a ground-truth transcription of onset times in a number of pieces. Techniques based on energy or energy and phase in mel-scale bands performed best.
The classification technique used involves splitting the data into pairs of Gaussian distributions (I think this may be similar to GMM, using a different criteria). This results in a tree which ends in nodes representing different states. Classification for each frame is then done by finding the percentage of each class present at the node which the frame's features exited the tree.
150 30 second samples of 7 musical genres were used as classes. Full-file- and non-overlapping-10-second-window-based feature vector combination techniques performed relatively poorly. Sliding one-second windows performed well, at the cost of having redundant data and as a result requiring more time. The onset-based segmentation techniques performed best, and did not require a great deal of time.