|Context||Columbia Neural Network Reading Group/Seminar Series|
How can we remove different parts of the speech recognition pipeline and replace it with neural networks? The speech recognition problem can be formulated as trying to determine the best set of hypothesized words given an audio waveform, which can be broken up into an acoustic model (mapping from waveform to words) and a language model (probabilities of words). Typically, a pipeline first extracts features from the audio and maps it to individual states, then converts those states to phonemes, then words and sentences. The acoustic model portion includes feature conversion and its mapping to phoneme states. This talk focuses on simplifying the acoustic model. The most common features used for neural networks are log-mel filterbank features, which are computed over windowed segments of the audio waveform. Ideally, we could learn this feature transformation in order to potentially learn a better feature representation. Once features are computed, we need to convert them to states - GMMs were traditionally used, but neural networks are used for state-of-the-art results now. The acoustic model maps acoustic features to phone states, but the number of states is huge when you consider triphones (phonemes with left and right context). So, typically the number of states is compressed. But ideally we could also replace this portion of the model with a learning-based approach.
State-of-the-art results have been achieved using CNNs, DNNs, and LSTMs, but they are each individually limited in their modeling capabilities. A limitation of using LSTMs alone is that you are relying on them to disentangle the underlying features of variation within the input, which could allow the LSTMs to focus on temporal structure instead. A good candidate are convolutional networks, which are particularly good for disentangling factors from raw input. Similarly, DNNs can be used to transform the hidden states to the phoneme state space. The proposed model, then, uses CNNs to disentangle factors in the spectrogram and perform speaker-adaptivity, LSTMs for temporal modeling, and DNNs for mapping the LSTM state to a space where better discrimination is possible. In this model, the input is a 40-dimensional log filterbank, which is convolved against a set of frequency-only CNN filters, then into a few LSTM layers, then into a series of dense layers. On a small test task, adding a single CNN layer before a LSTM layer helps, but stacking multiple CNN layers doesn't. Similarly, adding more than two dense after the LSTM layers doesn't help. Combining the CNN input and the DNN performs best. For a larger-scale task, adding convolutional and dense layers helps more than adding more LSTM layers. Adding temporal context didn't help - the LSTM is good enough to model the temporal variations. CLDNNs show about a 5% relative improvement for large-scale speech search tasks.
It would be nice to learn a feature representation on the front, rather than using the log-mel filterbank. No research has shown improvements of using raw waveforms over log-mel filterbanks so far. In this work, a gammatone processing approach is used, where instead of just using a filter and a rectifier (CNN), an average and a compression is also used. The output of the layer is then the dimensionality of the number of filters used; this representation can be treated as a time-frequency representation and used in place of the log-mel filterbank. In a large-scale voice search experiment, initializing the filter weights to be gammatone filters (e.g. something meaningful) results in a lower error rate than initializing to random filters or just using the gammatone filter as-is. For clean speech, this system improves on log-mel by a small amount; for noisy speech, the margin is bigger. The feature representation does resemble a log-mel-style time-frequency representation. If the peak frequencies of the filters are plotted, the resulting curves resemble log or gammatone filterbanks, whether or not the gammatone initialization was used. The main difference is that more filters are placed in the low-frequency range of the signal, which may be because it carries more information about the vowels. This is especially true of the filters trained on the clean speech. To experiment with which part of the new system was helping, a system was created without the frequency convolution layer, and with differing numbers of LSTM layers; in general, the learned-time-domain filter systems were unable to beat (and were sometimes worse) than log-mel filterbanks.
When training a neural network, an existing model is usually used to align the training speech waveforms to the word labels in order to get per-frame labels. Two ways this was typically done was with viterbi alignment which assigns a hard label to each frame, or with baum-welch, which assigns a soft label. Connectionist Temporal Classification (CTC) circumvents this by learning the alignment jointly while training the acoustic model. When training with cross-entropy, a label must be predicted at each frame; CTC allows a “blank” label so that only some frames have a predicted label. After outputting these labels with blanks, repeated labels without separating blanks are merged. Cross-entropy tries to maximize the correct class at every single frame with the frame-level alignment; in CTC, a “lattice” $z^l$ is defined which encodes all possible alignments of $x$ with $l$. A forward-backward probability on $p(z^l | x)$ is then computed and the error is backpropagated. This results in models which usually produce sharper transitions. The resulting approach can improve word error rates over using context-dependent states, particularly when using a bidirectional model (only LSTM models were tested, not CLDNN models). Future work will combine raw-waveform CLDNNs with CTC.