|Authors||Guangyu Xia, Dawen Liang, Roger B. Dannenberg, Mark J. Harvilla|
|Publication Info||Retrieved offline|
Proposes a system to help musicians automatically organize performance and rehearsal recordings.
If a musician records all of their performances and rehearsals, accessing a specific recording by hand can be time consuming. To expedite this process, recordings should be automatically separated into segments (omitting non-music segments), which are clustered by composition and presented to the user in a nice UI.
Past approaches may use low-level features like Spectral Centroid or Zero-Crossing Rate, but these are not robust to variations in sound source (noisiness, level). The proposed system used Eigenmusic features, which applies PCA to the audio data, producing the eigenvectors of a covariance matrix associated with an array of spectrograms of music data. The first 10 eigenvectors of the spectrogram matrix are extracted and used as the feature.
Adaboost develops a sequence of simple, iterated hypothesis and weights and combines them to create a strong classifier. The training and testing data are 5 and 2.5 hours of western and chinese music, respectively, stored as 10-dimensional per-frame feature vectors, extracted once every 1.25 seconds. 100 weak classifiers were used to create the strong classifier, which brought the error rate down to about 5.5% when interpreting the sign of the strong classifier output as the classification label. If the strong classifier output is interpreted as a probability of music vs. non-music, then the probability that a frame is music after a previous frame of music closely matches a logistic function.
Although the error rate was small, for segmentation, a spurious “non-music” frame could create a new segment which is not desirable. An HMM was trained to encapsulate the fact that state changes between music and non-music do not occur rapidly and that short-duration music/non-music intervals are less likely. The initial state probabilities and transition probabilities were set manually or by Maximum Likelihood Estimation, and the emission probabilities were set using Bayes' rule. The probability of music vs. non-music was set to 0.5 and the Viterbi algorithm was used to find the best possible state sequence for a given observation sequence. The use of HMM lowered the error rate to about 2%, with a fuzzy precision (how many boundaries are true ones?) of 89.5% and a fuzzy recall (how close are the boundaries to their true position) of 97%.
To summarize the harmonic data in the music segments, Chroma vectors are used. To add robustness to minor spectral variations, 41 consecutive overlapping 200ms windows are normalized with a 10-vector hop size, giving the Chroma Energy distribution Normalized Statistic features. To cluster features, the distance between CENS features for a query clip are calculated against subsequences of musical segments, and if the distance is less than a preset threshold, the segments can be clustered together. To generate the database, the longest segment is used to initialize the first cluster, then all other musical segments are compared to existing clusters, and new clusters (sorted by size) are created when the distance is smaller than the threshold. This scheme assumes that the largest musical segment is an entire piece, and that smaller segments are portions of pieces, which should already exist as clusters. To deal with tempo deviation between performances, the CENS features are extracted with varying long-term window and hop sizes, and the smallest distance is used in the clustering.
Longer segments will be more selective, but will make the tempo deviation issue worse, so the a subsegment of each segment can be taken. This subsegment size and thresholding value should be optimized. F-measure was used as an optimization metric, with <math>\beta = 0.9</math>. It is found that overall, shorter segments and smaller threshold values give a better performance. Smaller segments also save on computation, so a segment size of 40s and a distance threshold of 0.15 were used. This gave an F-measure of close to 1.