User Tools

Site Tools


Efficient Auditory Coding

Authors Evan C. Smith & Michael S. Lewicki
Publication Info Retrieved here
Retrieval date 9/27/12

Proposes a representation of acoustic waveforms based on a nonlinear spike code model.

Spike Code Formulation

The transform of an audio signal to a representation useful for auditory tasks is an important problem which human's auditory systems has evolved to solve. In particular, it may be that our sensory coding attempts to efficiently represent an acoustic signal as a spike encoding. This kind of encoding can be formulated as <math>x(t) = \sum\sum s_i^m\phi_m(t - \tau_i^m) + \eps(t)</math> where <math>x(t)</math> is the signal to be encoded (with error <math>\eps(t)</math> by a set of kernel functions <math>\phi_1, \ldots, \phi_m</math> of size <math>L_m</math> with temporal positions <math>\tau_i^m</math> and coefficients <math>s_i^m</math>. The kernel shape and size depends on the model. This model can encode arbitrary acoustic signals in a non-blocking manner with elements at precise amplitudes and temporal positions. To encode a signal, the optimal kernel functions <math>\phi_m</math> and the encoding (<math>\tau_i^m</math> and <math>s_i^m</math>) must be found via some algorithm which with a trade of computational complexity and accuracy. Here, matching pursuit is used to estimate <math>\tau_i^m</math> and <math>s_i^m</math>.


The fundamental structure of audio signals lie in a subspace which is not a linear projection of the original subspace. The difficulty of inferring the non-linear subspace depends on the sparsity and overlap of acoustic events. In practice, the inter-spike intervals form a gamma distribution with an average mode of 10 ms. Spike coding is non-linear in that it must determine which spike positions are best, with consideration for the other spikes' locations. More linear methods can lead to redundancy in time and across kernels.

Encoding algorithm

Given kernel functions <math>\phi_m</math>, we can project a signal <math>x(t)</math> onto one of the kernels by <math>x(t) = \langle x(t), \phi_m \rangle \phi_m + R_x(t)</math> where <math>R_x(t)</math> is the residual. The projection with the largest magnitude inner product will minimize <math>R_x(t)</math>. For matching pursuit, in each iteration, the residual from the previous iteration is projected onto the kernel space.

Kernel Learning

A gradient-based algorithm is used to optimize the kernel function's shape and length to maximize the fidelity of the representation. The codes should be adapted to statistics of the signals in question, so it was assumed that the auditory system was adapted to mammalian vocalizations, transient environmental sounds, and ambient environmental sounds (collected for the experiment). The learned kernel functions closely matched revcor filters in the cat auditory system (better than gammatone filters - they are assymetric), provided that all three classes of sounds were used. Training on speech (TIMIT) alone matched the revcor filters about as well as the natural sound ensemble.

Learning algorithm

Reformulating the spike coding problem probabilistically gives <math>p(x\mid\phi) = \int p(x \mid \phi, s)p(s) ds \approx p(x \mid, \hat{s})</math> where <math>\hat{s}</math> is an approximate posterior maximum found via matching pursuit. Assuming the noise in <math>p(x\mid\phi, \hat{s}</math> is Gaussian and <math>p(\hat{s})</math> is sparse, we can optimize the kernel functions with gradient descent on the approximate log data probability: <math>\frac{\delta}{\delta\phi_m}\log(p(x \mid \phi) ) = \frac{1}{\sigma_\epsilon^2} \sum_i \hat{s}_i^m \[x - \hat{x}\]_{\tau_i^m}</math> where <Math>\[x - \hat{x}\]_{\tau_i^m}</math> is the residual error over the extent of <math>\phi_m</math> at time <math>\tau_i^m</math>. Training is done by initializing 32 kernel functions as Gaussian noise with zero-padding; padding was added and removed depending on how much of the kernel was non-zero near the edges. Some kernels were not used in learning or encoding, these were discarded.


Comparing the S/N ratio of the speech-trained and gammatone spike code model to the DWT and the DFT at various bitrates revealed that the kernels had a better fidelity for low bitrates, but worse for higher bitrates where the DFT and DWT were better able to model the residual which was largely Gaussian noise. The similarity of the kernels to revcor filters shows that the revcor filters are ideally matched to the statistical structure of natural sounds. It also suggests that models of the cochlear should not have a fixed bandwidth across the filterbank. Using reverse correlation to derive the equivalent revcor filters from the theoretical model matches the learned kernel functions almost exactly - both are uncorrelated. That the kernels allow for efficient representation is crucial. Not all sound information is biologically relevant, so a biologically inspired model will not guarantee an accurate description.

efficient_auditory_coding.txt ยท Last modified: 2015/12/17 21:59 (external edit)