shift-invariant_sparse_coding_for_audio_classification

Authors | Roger Grosse, Rajat Raina, Helen Kwong and Andrew Y. Ng |

Publication Info | Retrieved here |

Retrieval date | 9/26/12 |

Proposes a sparse coding technique for unsupervised classification of audio.

Obtaining labeled data or labeling data for supervised or semi-supervised learning is expensive, and approaches like transfer learning (which uses additional knowledge from a separate task) also requires labeled data. Self-taught learning attempts to use easily unlabeled data to improve classification, even data from entirely different classes than the problem at hand. By developing algorithms that can leverage this large amount of additional data, learning performance may be possible.

Sparse coding of a signal attempts to learn a large dictionary of patterns (basis functions) such that the signal can be represented approximately by a sparse (small number) of these dictionary elements. In this way, audio signal vectors can be described by a few numbers which correspond to a weighted sum of distinct acoustic events and represent a higher-level representation of the audio signal. If a large amount of unlabeled data is used to generate a dictionary, the weights, positions, and shifts necessary to recreate an input signal can be used for supervised learning and classification.

Sparse coding can be expressed as a model where <math>m</math> unlabeled input signals <math>x_i</math> are assumed to be generated by a dictionary of <math>n</math> basis functions <math>a_j</math> according to linear combination components <math>s_{i, j}</math> drawn from a Laplacian (or other heavy-tailed) prior distribution with some Gaussian noise <math>\epsilon_i</math> added. Given unlabeled inputs, sparse coding involves solving the optimization problem to estimate <math>a_j</math> and <math>s_{i, j}</math> : <math>\min_{a, s} \sum_{i = 1}^{m}\|x_i - \sum_{j = 1}^n a_js_{i, j} \|_2^2 + \Be\sum_{i, j}|s_{i, j}| \mid \|a_j\|_2^2 \le c, 1 \le j \le n</math>. The constraint on <math>\|a_j\|_2^2</math> prevents solutions where <math>s_{i, j}</math> is very small and <math>a_j</math> is very large. The problem is convex in either <math>s</math> or <math>a</math> and can be solved with two separate optimizations keeping one variable fixed and optimizing the other.

Traditional sparse coding would have to learn every possible time offset of each acoustic pattern, so instead a shift-invariant sparse coding must be done which allows each basis function appear at each time offset in the signal, which can be reformulated as <math>x_i = \sum_j a_j\star s_{i, j} + \epsilon_i</math>. The bases <math>a_j</math> are allowed to be lower dimension than the input vectors to avoid redundancy, and the linear combination coefficients <math>s_{i, j, t}</math> are now vectors representing the coefficient for every possible temporal offset <math>t</math> of the basis <math>a_j</math>. This changes the optimization problem to <math>\min{a, s} \sum_{i=1}^m \|x_i - \sum_{j=1}^n a_j\star s_{i, j} \|_2^2 + \Beta\sum_{i, j}\|s_{i, j}\|_1 \mid \|a_j\|_2^2 \le c, 1 \le j \le n</math>. Expanding out the convolution quickly becomes computationally intractable, so an efficient algorithm (computing the coefficients and separately the bases) must be developed.

Fixing the bases <math>a</math>, the coefficients for one input can be found independently of other inputs, so choosing a single input we must solve <math>\min_s \|x - \sum{j = 1}^n a_j\star s_j\|_2^2 + \Beta \sum_{j=1}^n \|s_j\|_1</math>. The series of <math>s_{j}</math> over time instants <math>t</math> are highly coupled, making it difficult to heuristically select a small subset of variables to optimize. Using the “feature-sign” trick, the optimization problem can be reduced to an unconstrained quadratic optimization problem over the nonzero coefficients if the sign of the coefficients can be guessed correctly, which can then be solved with standard numerical methods. If few coefficients are nonzero over the course of the optimization problem, the quadratic optimization can be done quickly, but for long signals the number of nonzero coefficients grows linearly with the length of the signal which makes the algorithm too expensive. Solving the problem with an iterative sliding window of size <math>2q</math> and keeping the other coefficients fixed guarantees the optimal solution and is computationally feasible - usually only taking two passes.

Optimizing the choice of bases in the shift-variant sparse coding problem can be done by writing the quadratic objective as the sum of <math>p</math> quadratic terms, transforming it into a dual optimization problem. In SISC, the different basis functions are coupled in the objective, but the optimization problem can be made much easier by transforming to the frequency domain (<math>a \rightarrow A</math>). Parseval's theorem (<math>\|A\|_2^2 = K\|a\|_2^2</math>, <math>K</math> constant) suggests that the optimization problem will not be changed by transforming to the frequency domain, and the convolution theorem changes the optimization problem to <math>\min_{A} \sum_i \|X_i - \sum_j A_j S_{i, j} \|_2^2</math> subject to <math>\|A_j\|_2^2 \le cK</math>. For this problem, the Lagrangian can be decomposed as a sum of quadratic terms, each depending on a single frequency component and can be expressed as a function of only real variables by using the real and imaginary components of <math>A</math> separately. The Fourier transforms of every coefficient vector cannot be precomputed as storage would be an issue, but computing each component's Fourier transform is not efficient. Instead, the coefficient and coefficient-input vector matrices are precomputed, and only half of the frequency spectra are computed as the input is real.

Given a labeled training set <math>\{x_{l, 1}, y_{y, 1}, \ldots, x_{l, k}, y_{l, k}\}</math> where <math>y_{l, i}</math> are the labels and unlabeled examples <math>\{x_1, \ldots, x_m \}</math>, apply sparse coding to the unlabeled data to learn the basis function <math>a_j</math>, and then use the bases to construct features for each input by computing <math>f(x_l) = \operatorname{arg}\min_s \|x_l - \sum_j a_j s_j \|_2^2 + \Beta \sum_j|s_j|</math>. These features produce a higher-level representation which can be used with classification algorithms but are computationally infeasible and conceptually unsatisfactory for inputs with spatial or temporal extent. Instead, SISC should be used to learn shift-invariant basis functions <math>a_j</math> by solving <math>\~x_l = \operatorname{arg}\min_s \|x_l - \sum_ja_j\stars_j\|_2^2 + \Beta \sum_j \|s_j \|_1</math> where <math>x_{l, t, j}</math> is the coefficient found for basis <math>a_j</math> at time <math>t</math>. These features will be robust in to time shifts - the features will also shift, but without a change in values. Gaussian Discriminant Analysis posits that the conditional distribution of th einput given the class label is Gaussian: <math>P(\~{x}_l\mid y) = \prod_t P(\~{x}_{l, t} \mid y)</math> where <math>P(\~{x}_{l, t} \mid y)</math> is modeled as a multivariate Gaussian <math>\[\~{x}_{l, t, 1}, \ldots, \~{x}_{l, t, n} \]^T \sim \mathcal{N}(\mu_y, \Sigma_y)</math> (the parameters <math>\mu_y, \Sigma_y</math> depend on <math>y</math>. The SISC features are poorly modeled by a multivariate Gaussian generative distribution, so an additional two-step generative model for the SISC features is used where the sign is determined, and for non-zero sign its value is determined according to an exponential distribution.

Previous methods learned the coefficients using a heuristic to predict a subset of coefficients to optimize. The FS-EXACT algorithm converges to a nearly optimal solution much faster than the heuristic approaches. Using FS-EXACT with Newton's method on the dual problem (Fourier domain) also converges faster than other methods for solving for the coefficients. Two tasks were carried out to test classification accuracy: a speaker identification task using a small amount of labeled data and a large amount of unlabeled data from different speakers in different dialect regions and a musical genre identification task using a small number of labeled 6-second song snippets and a large amount of unlabeled data comprising songs of different genres of music than the labeled snippets. The SISC features were compared against MFCC and raw spectral features for the speaker identification task and MFCC, raw spectral, and a hand-engineered set of features for genre classification. Classification was done with each feature set using SVM (input vectors averaged over the entire signal), GDA and their MultiExp algorithm (input vectors averaged over windows). Classifier parameters were chosen using cross-validation on a separate development set. The SISCs using SVMs outperformed other features in every case in the genre identification task. Speaker identification was done with noise added to the system - either the same noise for train/test, random for all examples, or differently for train and test. For the random and different cases, SISCs performed best.

shift-invariant_sparse_coding_for_audio_classification.txt · Last modified: 2015/12/17 21:59 (external edit)