|Context||Columbia EESIP Seminars|
Natural sounds have structure - speech, for example, has spectral content which relates to the phonemes and syllables. It would be useful to enable a computer to learn this structure. Spectrograms may not be the best way to represent sound to learn the structure, because of redundancy, time/frequency resolution trade-off, and artifacts caused by blocking. Instead, it may be better to code in the time domain, with a “spikegram” which places short kernels (eg gammatones) at a specific time with some weight. If the spikes are convolved with the dictionary, you get the reconstruction; learning the spikegram is less easy. This representation can be more compact. The spikegram is highly correlated - if you observe a certain spike at a certain time, there is observable structure when you try to determine the likelihood of observe the likelihood of another spike. For example, it's unlikely that you are going to observe one spike immediately after another.
In order to model the structure, we can construct a second-level spikegram representation where spikes are convolved with larger kernels that modulate the firing rate of the first layer spikes. This basically indicates the density of the first-layer spikes. The second-layer kernels should have shapes according to the higher-level structure of the first-layer spikegram. The second-layer kernels are additive, and are shiftable in time and log-frequency. The second-layer spikegram will encode the first-layer spikegram on a coarse scale; finer-scale relationships (which are probably due to the encoding) are encoded using coupling kernels which are recurrent. The second-layer spikes are convolved both with rate kernels and amplitude kernels. The rate, amplitude, and coupling kernels must be learned, but the second layer spikes are not known either (which is a non-convex problem) - so an iterative model is used similar to matching pursuit.
The learned coupling kernels indicate the same structure where a kernels is unlikely to re-occur on a short scale, and indicate that certain other gammatones are likely to occur shortly after others. The amplitude and rate kernels inddicate higher-level structure, like harmonic stacks, onsets and frequency sweeps. Even when the structure appears different, the frequency and time profiles tend to be different. They also tend to be bimodal due to different gender (male/female) - so there is some consistency in the second-level spikes between genders. Resynthesizing some original spikegram using the second-layer model using only the amplitude and spike kernels retains the course structure but not the fine-level correlations. Including the coupling kernels helps impose the fine-scale correlation but does not get that much closer to the original spikegram. Across different speakers, the second-level spikegram tends to use similar kernels.
Both spikegrams and hierarchical spikegrams can be used to denoise an audio signal. With matching pursuit, the “more important” frequency components tend to be modeled first, so the noise can be thought to be suppressed in the reconstruction. With the second-layer spikes, there is an imposed structure which helps inform which parts are noise (non-speech). Hierarchical spike coding tends to do just a tiny bit better (SNR-wise) than the standard model when removing constant white Gaussian noise from speech although the quality of the reconstruction is much different. If the noise is temporally modulated by the speech's amplitude envelope, all methods do worse but the hierarchical spike coding still does a tiny bit better.