identifying_repeated_patterns_in_music_using_sparse_convolutive_non-negative_matrix_factorization

Authors | Ron J. Weiss and Juan Pablo Bello |

Publication Info | Retrieved here |

Retrieval date | 2/23/11 |

Proposes a method for music segmentation by factoring a beat-synchronized matrix of chroma vectors. A summary of the presentation given for the paper at ISMIR10 is here.

Most music involves repetition, but it's hard to extract from the audio itself. Previous techniques use self-similarity matrices and HMMs. The proposed method uses shift-invariant PLCA, and treats music as a concatenation of a subset of short, repeated patterns, whose length is not necessarily known in advance.

NMF decomposes a matrix <math>V</math> into <math>WH</math>, where (when <math>V</math> is a time-frequency representation of the audio signal, like beat chromagrams in the paper) <math>W</math> has frequency templates in each column and <math>H</math> has the “amount” of each template at each time. PLCA treats <math>W</math> and <math>H</math> as multinomial probability distributions and adds a diagonal mixing weight matrix <math>Z</math>. For shift invariance, PLCA is modified so that W is a tensor, and is combined via convolution: <math>V\approx\sum_k W_k \star z_k h_k^T</math>

It is favorable to use sparse matrices in NMF, which can be encouraged by prior distributions. To learn the number of patterns, the Dirichlet (multinomial conjugate) distribution is used, with a parameter <math>\alpha_z \lt 1</math> which favors sparse solutions. If <math>z</math> is sparse, then initializing with many bases will prune out those which do not which do not contribute significantly (removing the need to specify the exact composition rank). The pattern length <math>L</math> should be variable per-basis. A Dirichlet prior is also used with a parameter <math>\alpha_w</math> that depends on the time position <math>\tau</math> within each basis - this parameter basically penalizes pattern lengths greater than some value. The sparsity of <math>h_k^T</math> should also be enforced to capture more information about <math>V</math> in <math>W_k</math>. The result is that certain bases will have information which repeats often, and others will have “special case” information.

The EM algorithm is used. Expectation: calculate the posterior distribution over variables <math>k</math> and <math>\tau</math> for each cell in <math>V</math>, represented as a set of matrices for each possible combination where each point represents the probability that the corresponding point in <math>V</math> was generated by basis <math>k</math> at time <math>\tau</math>. Maximization: Update the decomposition according to parameters.

<math>H</math> encodes long-term structure; <math>W</math> encodes short-term structure. Per-beat chromagrams are calculated, and sequences of chromagrams are assumed to happen only once, and one chromagram sequence is chosen per-segment via a rectangular smoothing window. This segmentation is not necessarily verse/chorus/etc, but instead is “groups of a similar chord progression”. The amount that this is true is parameterized by <math>\alpha_{w\tau}</math>.

Tested on a set of Beatles songs, with performance measured using pairwise precision, recall, and F-measure and entropy-based over- and under-segmentation scores. The number of patterns is optimized by varying <math>K</math> with <math>\al_z=1</math> and varying <math>\al_z</math> with <math>K=15</math>. Optimal results were when <math>\al_z = 0.98</math> and <math>K=4</math>. <math>L</math> was optimized between 10 and 120 beats, and was found to peak at <math>L=20</math>. Allowing for varying pattern length has negligible effect on segmentation performance (even if the results are quantitatively better). The proposed system (with fixed rank and with sparse <math>z</math>) to an HMM and a self-similarity matrix based system, and got comparable results. The self-similarity system had better boundary detection, mostly thanks to the smoothing used in the proposed system.

identifying_repeated_patterns_in_music_using_sparse_convolutive_non-negative_matrix_factorization.txt · Last modified: 2015/12/17 21:59 (external edit)