User Tools

Site Tools


identifying_repeated_patterns_in_music_using_sparse_convolutive_non-negative_matrix_factorization

Identifying Repeated Patterns in Music Using Sparse Convolutive Non-Negative Matrix Factorization

Authors Ron J. Weiss and Juan Pablo Bello
Publication Info Retrieved here
Retrieval date 2/23/11

Proposes a method for music segmentation by factoring a beat-synchronized matrix of chroma vectors. A summary of the presentation given for the paper at ISMIR10 is here.

Background

Most music involves repetition, but it's hard to extract from the audio itself. Previous techniques use self-similarity matrices and HMMs. The proposed method uses shift-invariant PLCA, and treats music as a concatenation of a subset of short, repeated patterns, whose length is not necessarily known in advance.

NMF, PLCA, and SI-PLCA

NMF decomposes a matrix <math>V</math> into <math>WH</math>, where (when <math>V</math> is a time-frequency representation of the audio signal, like beat chromagrams in the paper) <math>W</math> has frequency templates in each column and <math>H</math> has the “amount” of each template at each time. PLCA treats <math>W</math> and <math>H</math> as multinomial probability distributions and adds a diagonal mixing weight matrix <math>Z</math>. For shift invariance, PLCA is modified so that W is a tensor, and is combined via convolution: <math>V\approx\sum_k W_k \star z_k h_k^T</math>

Enforcing sparse matrices

It is favorable to use sparse matrices in NMF, which can be encouraged by prior distributions. To learn the number of patterns, the Dirichlet (multinomial conjugate) distribution is used, with a parameter <math>\alpha_z \lt 1</math> which favors sparse solutions. If <math>z</math> is sparse, then initializing with many bases will prune out those which do not which do not contribute significantly (removing the need to specify the exact composition rank). The pattern length <math>L</math> should be variable per-basis. A Dirichlet prior is also used with a parameter <math>\alpha_w</math> that depends on the time position <math>\tau</math> within each basis - this parameter basically penalizes pattern lengths greater than some value. The sparsity of <math>h_k^T</math> should also be enforced to capture more information about <math>V</math> in <math>W_k</math>. The result is that certain bases will have information which repeats often, and others will have “special case” information.

Expectation/Maximization for Parameter Estimation

The EM algorithm is used. Expectation: calculate the posterior distribution over variables <math>k</math> and <math>\tau</math> for each cell in <math>V</math>, represented as a set of matrices for each possible combination where each point represents the probability that the corresponding point in <math>V</math> was generated by basis <math>k</math> at time <math>\tau</math>. Maximization: Update the decomposition according to parameters.

Segmentation

<math>H</math> encodes long-term structure; <math>W</math> encodes short-term structure. Per-beat chromagrams are calculated, and sequences of chromagrams are assumed to happen only once, and one chromagram sequence is chosen per-segment via a rectangular smoothing window. This segmentation is not necessarily verse/chorus/etc, but instead is “groups of a similar chord progression”. The amount that this is true is parameterized by <math>\alpha_{w\tau}</math>.

Evaluation

Tested on a set of Beatles songs, with performance measured using pairwise precision, recall, and F-measure and entropy-based over- and under-segmentation scores. The number of patterns is optimized by varying <math>K</math> with <math>\al_z=1</math> and varying <math>\al_z</math> with <math>K=15</math>. Optimal results were when <math>\al_z = 0.98</math> and <math>K=4</math>. <math>L</math> was optimized between 10 and 120 beats, and was found to peak at <math>L=20</math>. Allowing for varying pattern length has negligible effect on segmentation performance (even if the results are quantitatively better). The proposed system (with fixed rank and with sparse <math>z</math>) to an HMM and a self-similarity matrix based system, and got comparable results. The self-similarity system had better boundary detection, mostly thanks to the smoothing used in the proposed system.

identifying_repeated_patterns_in_music_using_sparse_convolutive_non-negative_matrix_factorization.txt · Last modified: 2015/12/17 21:59 (external edit)