hierarchical_spike_coding_of_sound

Authors | Yan Karklin, Chaitanya Ekanadham, Eero Simoncelli |

Publication Info | Retrieved offline |

Retrieval date | 11/6/12 |

Proposes a two-layer probabilistic generative model for acoustic structure which first codes sound using kernels and then codes patterns in positions of the kernels.

Sounds can be described by events with precise timing and frequency, but understanding the sound requires a higher-level representation which is not sensitive to signal variability. Previous approaches typically try to decompose the signal's spectrogram into a sparse decomposition of patches. By operating on spectrograms, time and frequency resolution is sacrificed. Instead, a representation which is shiftable in time and frequency is proposed which generates a “spikegram” representation then modulates the spikegram coefficient's probabilities.

A spikegram represents a signal <math>x_t</math> as a linear combination (scaled by <math>S_{\tau, f}</math>) of time-shifted (by <math>\tau</math>) kernels <math>\phi_f(t)</math> with residual <math>\epsilon_t</math> by <math>x_t = \sum_{\tau, f} S_{\tau, f}\phi_f(t - \tau) + \epsilon_t</math>. Here, <math>\phi_f</math> are chosen to be gammatones. Spikes (nonzero coefficients at certain times and frequencies) are estimated using an approximate inference method.

The spikegram exhibits structure at a fine and coarse temporal scale, which may be caused by large-scale acoustic events, correlations in the dictionary kernels, and temporal structure. A second layer of unobserved spikes <math>S^{(2)}</math> is added which are assumed to have a Poisson process prior and are convolved with a set of time-frequency “rate” kernels <math>K^r</math> to modulate the logarithm of the firing rate of the first-layer spikes on a coarse scale. A different set of “coupling” kernels <math>K^c</math> are convolved with the local spike history to modulate the firing rate of the first-layer spikes. Finally, the mean of the log-amplitudes of the first-layer spikes is modulated according to a convolution of second-layer spikes with “amplitude” kernels <math>K^a</math> without recurrent contribution. The full model is formulated in the paper.

The model parameters (rate, coupling, and amplitude kernels and log-rate and log-amplitude bias vectors) can be estimated by maximizing the joint log-probability of the first and second layer using a coordinate descent approach which alternates between approximating the model parameters and maximizing the joint log-probability. Approximating the model parameters can use a gradient-based method. Inferring the second layer spikes (to maximize joint log-probability) is NP-hard, so a matching pursuit algorithm on the Taylor approximation of the joint log-probability is used.

The TIMIT speech corpus was modeled with a 200-gammatone dictionary to 20dB, and a model with 20 rate and amplitude kernels and coupling kernels for each frequency channel was trained. The rate kernels and amplitude kernels resembled harmonic stacks, frequency ramps, temporal onsets and offsets, and acoustic features. The coupling kernels indicated that when a spike occurs, it and its neighboring spikes do not reoccur soon after and that there is alignments of spikes across time and frequency. The features combine to approximate, for example, different vowels - even with the same kernels across different pitches and speakers. A spikegram can be synthesized by sampling the generative model, and can produce intelligible results, especially when both the hierarchical and coupling kernels are used.

Spikegrams can be used for denoising if the matching pursuit is terminated when the log-likelihood is less than the cost of adding a spike. If the fixed spike probability is replaced with the rate specified in the Hierarchical Spike Coding, and the layers of the HSC are inferred with a MAP estimation, the signal reconstructed from the first layer can be denoised. The effectiveness of each approach on noisy speech was compared to Wiener filtering and wavelet-based thresholding. The HSC method performed the best, only slightly better than the normal matching pursuit spikegram, both with white noise and sparse temporally modulated noise.

hierarchical_spike_coding_of_sound.txt · Last modified: 2015/12/17 21:59 (external edit)