on_the_use_of_sparse_time-relative_codes_for_music

Authors | Pierre-Antoine Manzagol, Thierry Bertin-Mahieux and Douglas Eck |

Publication Info | Retrieved here |

Retrieval date | 10/23/12 |

Evaluates the effectiveness of sparse and time-relative representations for music signals.

Many auditory analysis techniques start by taking the DFT, which trades off time-frequency precision and is sensitive to the blocking of the signal required. Sparse coding represents a signal as the sum of basis functions taken from an overcomplete dictionary. In the frequency domain, this involves deriving a codebook of power spectra and assuming the spectra of a signal is a weighted sum of dictionary elements. In the time domain, kernels are matched to a signal either using a MAP estimate, a correlation threshold, or matching pursuit (which is the best compromise between computational complexity and SNR). Time domain kernels can be predicted explicitly (eg gammatones) or learned with a gradient descent method.

A signal <math>x</math> is modeled as the linear superposition of <math>M</math> kernels <math>\phi_m</math> at time locations <math>\tau_i^m</math> and scaling coefficients <math>s_i^m</math> with error <math>\eps</math> as <math>x(t) = \sum_{m=1}^M\sum_{i=1}^{n_m}s_i^m\phi_m(t-\tau_i^m)+\eps(t)</math>. Encoding a signal in this formulation is NP-hard, so a greedy matching pursuit algorithm is used which cross-correlates the signal with the kernels, chooses the best-fitting projection, then repeats the process over the residual.

Gammatone kernels are windowed sinusoids with a sharp rise and slow decay which fit many real phenomenon. Fitting gammatones to an insturment note makes the onset very apparent due to the increase in kernels used to estimate the signal. The model also tries to be as efficient as possible - only using as many kernels are “needed” at a given point in time, where more kernels are needed for more complex signals. In the context of music, many more kernels were needed for high frequency content in metal music. The number of kernels used has only a small impact on the signal to noise ratio.

Adaboost was used on a dataset of recordings of music labeled by genre. Because classifiers take a fixed number of inputs and the number of kernels used to represent a signal varies, a feature needs to be derived from the spike code. A 257-dimensional feature vector was calculated by counting the number of times each spike is used in each segment, summing the spike's scaling coefficients (normalized and unnormalized), and the total number of spikes in a segment. This gave comparable results classifying genre to MFCCs, and better results when combined with MFCCs.

Hopefully, kernels learned from music will better encode music signals, provide better input for MIR tasks, and form a meaningful representation of the music signal. Learning kernels <math>\phi</math> of a signal <math>x</math> is done by gradient ascent on every dimension using the residual signal where the gradient is <math>\frac{\delta}{\delta\phi}p(x \mid \phi) = \frac{1}{\sigma_\epsilon} \sum_i \hat{s}_i\[x-\hat{x}\]_{\tau_i}</math>. 32 normalized kernels were learned on a large database and did not resemble the gammatones very closely.

Gammatones achieved a better SNR than the learned kernels when coding songs from codebooks of the same size. The learned kernels performed about as well as the gammatones for genre discrimination.

on_the_use_of_sparse_time-relative_codes_for_music.txt · Last modified: 2015/12/17 21:59 (external edit)