dictionary_transforms_for_audio

This is a catch-all page for research on the topic of representing an audio signal using a combination of columns in a dictionary matrix.

Hopefully, some transform can be derived which will prove to be more useful than commonly used transforms (the DFT). In order to be effective, it should be:

- motivated by properties of audio signals. Generally, it “intelligently” deal with the time-frequency trade-off; it would be logical to represent frequency-specific elements in a logarithmic or continuous way.
- higher-level, allowing for features to be derived in a more straightforward way and learning/labeling/classification/separation to be done with higher accuracy. It would be great if the dictionary elements represented very high-level things, like different notes on different instruments or phonemes, but there is probably a trade-off between high-level and closed-form/perfect reconstruction.
- time-localized. Instead of windowing the audio signal and getting a transform of blocks of the signal irrespective of phase (eg, sparse coding magnitude DFT), it would be better for the transform to be able to localize events in time - either by coding in the time domain or considering phase.
- closed-form - that is, there should be a (hopefully simple) set of equations which describe the entries of the dictionary. This is mainly so that it can be used by anyone to achieve the same results… there should be some common specification, instead of a dictionary which is entirely derived by some learning process on some dataset.
- able to achieve (close to) perfect reconstruction. There will probably be some error, especially with an arbitrary learned dictionary. Is there a way to ensure at least that the reconstructed signal sounds “good”, and not low-bandwidth or artifact ridden?
- efficient to compute, or at least will guarantee a solution. The feasibility of finding a sparse solution depends on the characteristics of the dictionary elements… the feasibility of deriving a fast algorithm probably depends on much more complex things (what?)

Note that the DFT is closed form, achieves perfect reconstruction, and is efficient (FFT), but is not motivated by properties of audio signals and is low-level and the magnitude DFT not time-localized.

Shift-Invariant Sparse Coding for Audio Classification proposes an efficient algorithm when faced with dictionary elements which are shift-invariant, resulting in a convolution instead of a linear combination.

Polyphonic Pitch Tracking By Example creates a huge dictionary of labeled spectograms and finds a sparse combination of dictionary elements that represents the audio signal using a nearest subspace search.

Efficient Auditory Coding learns dictionaries from different corpuses in a “spike” model and finds that for certain audio signal classes, the dictionary elements closely resemble gammatone filters.

Sparse Representations in Audio and Music: from Coding to Source Separation gives a nice summary of various places dictionary-based transforms are used in audio and music.

Sparse Representations of Polyphonic Music

On the Use of Sparse Time-Relative Auditory Codes for Music

- Use scikits.learn to get a “first-pass” dictionary from a very large amount of audio data. Try different corpuses - speech data, music, environmental sounds, all audio…
- Perform some kind of fitting on the learned dictionary elements to try to get a closed-form version. Compare the accuracy and reconstruction using the purely learned version and the closed-form version. There are probably lots of fitting methods to try, including heuristic ones. For example, maybe some kind of tapered sine/transient dictionary will reveal itself?
- Ideally in the process of parameterizing the dictionary elements, provisions could be made to ensure that a solution can be found quickly, and potentially that perfect reconstruction is possible.
- Would it help if the dictionary elements were in the complex domain? What about convolution? Combining them in a non-linear way?
- Could the dictionary elements be resampled, to remove redundancy in having an atom for each frequency? This similar to allowing shifts in the frequency domain.
- Tasks to try - multipitch estimation by labeling dictionary elements, onset detection if certain elements indicate onsets, multi-instrument classification if elements indicate different instruments, compression by fixing and encoding the dictionary and using a sparse number of elements + colored noise, automatic enhancement by fitting sound to a “good sounding” model, etc.
- The distance metric both for encoding and dictionary learning needs to be something perceptually motivated - not a simple L2 norm or correlation of samples/spectra.
- If we compute the distance between some function of the encoded signal and the input, then can we have any kind of guarantees of convergence/recovery? Especially if the function is nonlinear.
- If we can't make any claims about convergence/recovery when doing some nonlinear transformation of the signal, can we learn a tractable transformation which gives good results?
- When we learn/construct the dictionary, to what extent can we make the elements orthonormal and space-spanning and incoherent, and to what degree does it actually matter?
- Can dictionary learning be made more tractable in a “supervised” manner by imposing constraints on the activation vectors?

Need to fill this in more…

- Using scikits.learn on STFT magnitude spectra could work OK, but the dictionary elements were allowed to be negative which isn't intuitive. The elements were also not structured unless they included an entire spectrum (as opposed to just a patch of the spectrogram). This seems like it's how most people do it.
- Using scikits.learn on STCQT magnitude spectra gave more obvious structure to the dictionary elements with smaller patches which was good, but also should probably not allow negative elements and also didn't seem to have a good reconstruction. Reconstruction is also inherently inexact with the CQT.
- Using scikits.learn in the time domain directly was pretty computationally expensive and had the huge issue of not treating noise well - training on speech with few dictionary elements, only a couple were noise - increasing the dictionary size to HUGE just added lots more noise elements, and in the reconstruction the noise components were basically ignored. This is because the distance metric can't cope with noise so a different metric is needed.

Need to fill this in with report from 2012.

In traditional sparse coding, we seek to represent a vector $y \in \Re^m$ as $Ax$, where $A \in \Re^{m \times n}$ and $x \in \Re^n$. Here, $A$ can be thought of as a dictionary of $n$ atoms $A_i \in \Re^m$ and each element of $x$ can be thought of as the activation of each of the atoms. $x$ is typically constrained to be sparse via some regularizer $\phi(x)$. A common approach is to cast the problem as a convex optimization over $x$ by setting $\phi(x) = ||x||_1$ and solving $\min_x ||y - Ax||_2^2 + \lambda ||x||_1$ (Lasso regularization). More generally, we can formulate the problem of sparse coding as $\min_x \phi(x) \mathrm{\;s.t.\;} Ax \approx y$, which is not necessarily convex.

In the domain of signal processing, we often want to describe a signal as the sum of a small number of signals which may give insight into the original signal's content. This problem is well-posed as a sparse coding problem, as we can think of the dictionary $A$ as containing a set of signals which may have informative characteristics. Solving $y = Ax$ and inspecting $x$ can then give insight into characteristics of $y$.

Suppose that prior to performing sparse coding we map our time-domain signal vectors to a vector space $T$ via some transformation denoted by $f(y)$ with inverse $f^{-1}$ such that $f^{-1}(f(y)) = y$. Our dictionary then consists of vectors in $T$, which we denote as $f(A)$. Sparse coding then attempts to find a sparse $x$ such that $f(y) = f(A)x$, or for clarity, $f(y) = \sum_{i = 1}^n f(A)_ix_i$. This is a logical step when the constraint $Ax \approx y$ is too strict in some way, and so instead we wish to perform sparse coding in the space $T$ where $f(A)x \approx f(y)$ “makes sense”.

We would hope that the application of the transformation $f$ still allows us to describe a signal $y$ as the sum of informative signals. However, it is clear that in order to maintain this interpretability we require that $\sum_{i = 1}^n f(A)_ix_i = f\left( \sum_{i = 1}^n A_ix_i \right)$ - in other words, we require that $f$ is additive. More concretely, if we describe $f(y)$ as $f(A)x$ we can only assert that $y$ is the combination of the atoms of $A$ if $f$ is additive. However, it is often the case that an appropriate $f$ is nonlinear and therefore does not satisfy additivity.

An alternative approach would be to instead focus on the problem $\min \phi(x) \mathrm{\;s.t.\;} f(Ax) \approx f(y)$. This has the intuitive interpretation of attempting to construct a signal out of elements of $A$ which is close to the signal $y$ in the vector space $T$, which has the desired property where signals which are close are semantically similar. Formulating using the Lasso regularization produces the optimization problem $\min_x ||f(Ax) - f(y)||_2^2 + \lambda ||x||_1$. Unfortunately, if $f$ is nonlinear, this optimization problem may become nonconvex. (should show this)

(Discuss quadratic $f$ here)

Motivated by this discussion, we focus on two approaches to this problem where $f$ is a quadratic function. First, we disregard convexity and attempt to solve the Lasso regularization with a local search. We also reformulate the problem as a nonconvex optimization via quadratic lifting and propose a convexification technique. We compare these approaches in the context of sparse coding audio signals.

dictionary_transforms_for_audio.txt · Last modified: 2015/12/17 21:59 (external edit)