User Tools

Site Tools


Sparse Representations in Audio and Music: from Coding to Source Separation

Authors Mark D. Plumbley, Thomas Blumensath, Laurent Daudet, Rémi Gribonval, Michael Davies
Publication Info Retrieved here
Retrieval date 11/20/12

Provides an overview of applications of sparse representations in many audio and music processing tasks.


Many audio signals are generated by resonant systems and physical impacts, which are respectively sparse in frequency and time, so we might hope to represent them with a sparse combination of elements from some basis or dictionary. Given $T$ samples of an audio signal represented as a vector $\bar{x} = [x(1), \ldots, x(T)]$, this can be written as $\bar{x} = \sum_{q = 1}^Q u_q \bar{\phi}_q$ where $\bar{\phi}_q$ are the basis vectors, or equivalently $\bar{x} = \bar{u}\Phi$ where $\Phi_{qt} = \phi_q(t)$. A typical basis $\Phi$ is the DFT basis, where $\phi_q(t) = \frac{1}{T} \mathrm{exp}\left( \frac{2\pi j}{T} qt\right)$ which is invertible so we can set $\bar{u} = \bar{x}\Phi^{-1} = \bar{x}T\Phi^{\mathrm{H}}$. DCT and DWT bases are similarly invertible. The DFT and MDSCT are fairly sparse for steady-state audio signals, but are nonsparse for transient signals.

If the number of basis vectors exceed the number of signal samples ($Q > T$), the system will be underdetermined, but we can seek a sparsest activation vector $\bar{u}$ instead. Sparse solutions can be useful because they can represent a signal in a high-level way for separation, denoising, etc. Finding the sparsest $u$ can be formulated as $\min ||\bar{u}||_0$ such that $\bar{x} = \bar{u}\Phi$, but this is an NP-hard, non-convex problem. We can relax to solving $\min ||u||_1$ such that $\bar{x} = \bar{u}\Phi$. Alternatively, a greedy algorithm like matching pursuit or orthogonal matching pursuit can be used which attempts to code the signal iteratively until some approximation criteria is met.

Audio Coding/Compression

For good compression, we might hope to be able to decompose audio signals in a domain where energy is concentrated on a few terms, but the diversity of audio signals makes this difficult for fixed orthonormal bases. One option is to adapt the orthonormal basis according to the signal's local characteristics. The DFT is guaranteed to not have nice regularities in both time and frequency, so typically lapped orthogonal transforms are used which exploit aliasing cancellation properties of the cosine transform under certain conditions, yielding the MDCT where $\phi_{k, p}(t) = h(t - pL) \sqrt{\frac{2}{L}} \cos \left( \frac{\pi}{L} \right(t - pL + \frac{1 + L}{2} \right)\left(k + \frac{1}{2}\right) \right)$ where $L$ is the frame size and $h$ is a window. The MDCT is not shift-invariant and its time and frequency resolution is suboptimal. However, the choice of the window allows it to be adapt when a transient signal is detected.

Another option is to use an overcomplete basis with the intention of only using coefficients deemed significant in the coding. There is a trade-off between increasing dictionary size and thus increasing sparsity and complexity. A “union of bases” can be used, which is composed of (for example) an MDCT basis (for steady-state signals) and a wavelet basis (for transient parts) or multiple MDCT basis at different time-frequency resolutions, such that (for the MDCT-wavelet case) $\bar{x} = \bar{u}_C\Phi_C + \bar{u}_W\Phi_W$. An 8-MDCT basis with power-of-two windows, with coding done by Matching Pursuit, was shown to have comparable quality to MPEG-2 AAC. The 8MDCT domain also proved to be helpful for common MIR tasks. A “holy grail” coding would be one where the coding is semantically meaningful and the resynthesis is perceptually good.


Ideally, the noisy (non-desirable) part of a signal is poorly represented in the sparse coding, so discarding small coefficients and reconstructing the signal can serve as a form of denoising. Under a probabilistic framework, priors can be included which help decide which parts of the signal are desirable.

Source Separation

In some cases we are presented with an audio signal which is actually the mixture of other audio signals, and we would like to recover the individual mixture components. Under the instantaneous model, time delays and reverberation are ignored and the signal is measured $I$ times as $x_i(t) = \sum_{j = 1}^J a_{ij}s_j(t) + e_i(t)$ (in matrix form $X = AS + E$) where $a_{ij}$ is the mixture weight of source $s_j(t)$ with noise $e_i(t)$. With small noise and a known square, invertible mixing matrix $A$, we can recover $\hat{s}(t) = A^{-1}x(t)$. The pseudoinverse can be used for noninvertible $A$, or independent component analysis can be used when $A$ is not known.

If there are fewer observations than sources, the system is underdetermined and sparse coding can help separate the sources or approximate $A$. Hopefully if we sparse code the mixture the mixture coefficients correspond to one of the original sources. If the original sources have sparse representations in a basis matrix $\Phi$, then $S = Z\Phi$ and in noiseless instantaneous mixing we have $U = X\Phi^{-1} = AZ$. If $Z$ is sparse such that at most one source coefficient is active (each column of $Z$ contains at most one nonzero entry) then each vector of $U$ is a scaled version of one of the mixing matrix $A$'s columns. We can estimate the the single nonzero source coefficient by finding the mixing matrix column which is most correlated with each column of $U$ and forming a mask which is $1$ for this index and $0$ otherwise, known as binary masking. When at each transform index at most one source has a non negligible coefficient, this is a good estimation. This assumption is violated in the time domain with temporally overlapping sources, but in the MDCT basis the separation is better. If at most one source coefficient is active at each transform index, then the mixing matrix can be estimated by the correlation of each of the measurements of the signal. If only one measurement is available, additional statistical information must be exploited to identify the active sources.

In the convolutive case, $A$ is a matrix of mixing filters applied to each source to give $X \approx A\star S$ (which is multiplicative in the frequency domain). This allows for convolutional reverb and delays from the source to the measurement. If only a few components are active at some time-frequency index, then the measurements can be approximated by only considering those components in the mixing matrix. The source coefficients for these components can be estimated by using the pseudoinverse of $A$ (with only the active sources) multiplied against the measurements. Estimating the set of active sources is a sparse approximation problem, and can be solved in a binary (L0 norm) or continuous (L1 norm) setting.


If the notes being played in a piece of music are considered as sparse coding coefficients, they represent a transcription of the piece of music. We would like to, for example, describe the spectrogram of a recording of music $X$ using a matrix of vectors of note activities $S$. The dictionary $A$ used to sparse code $X$ as $X \approx AS$ can be learned from the data in order to try to obtain activation vectors $S$ which correspond to note activations. This relies on the assumption that each note corresponds to a single scaled spectrum which does not hold for real-world recordings. Coding can also be done in the time domain using a shift-invariant sparse coding of audio waveforms.

sparse_representations_in_audio_and_music_from_coding_to_source_separation.txt · Last modified: 2015/12/17 14:59 (external edit)