|Context||Columbia Neural Network Reading Group/Seminar Series|
Until recently, the dominant paradigm for acoustic modeling was Gaussian mixture models. Subsequently, deep neural networks were used. Now, recurrent (long-short term memory) networks are being used to achieve state-of-the-art results.
Recurrent neural networks are an extension of feed-forward neural networks, where rather than simply feeding the activations forward, the output is fed back as an input at the following time step. The output of the network therefore reflects not only the input but also the previous state of the network across time. This is in contrast to feeding a fixed time window into the neural network, which explicitly limits the temporal context of the output. Recurrent networks allow for all kinds of sequence labeling tasks (labeling an entire sequence, predicting future time steps, transducing/processing signals, etc). The standard RNN has three weight matrices - one to project the input to the hidden state, one to project the hidden state back to the hidden state, and one for projecting the hidden state to the output. There are also biases and a nonlinear activation both for computing the new hidden state and the output.
To train a recurrent neural network, a straightforward method is to “unroll” the networks through time (called backpropagation through time, BPTT). This makes it look like a feed-forward DNN where the weights at each layer are identical, and each layer corresponds to a single time step. Training looks the same, because we have a target output for each time step. We can then compute the gradient at each step of the loss (external gradients) and compute the gradients of the parameters for each of the time steps (internal gradients). By summing across time, you can obtain a gradient vector for updating the network parameters given a sequence. If the sequence is particularly long (thousands of time steps), sometimes the network is only unrolled for a truncated number of steps (truncated backpropagation through time), but when forward passing the entire sequence is used.
Early on, it was found that even though RNNs have the capability to model long-term dependencies, it's very difficult to learn them using back-propagation through time because the weight matrices are repeatedly applied at each time step so the gradients can vanish or explode over time depending on the eigenvalues of the weight matrices. This effectively limits their practical time span to 5 or 10 steps. To mitigate this problem, Hochreiter and Schmidhuber proposed long short-term memory recurrent neural networks, whose units include an internal cell with multiplicative gates which can store state for an arbitrary period of time. Each unit has gates for controlling the amount of input to the cell, the amount the cell retains its contents, and the amount the contents of the cell are fed to the output. The value of the gates is between zero and one, and is determined by mixing the input, output, and cell contents of the network and passing them through a logistic sigmoid function. Feeding the cell contents into the gates is called adding “peephole” connections; this helps for precise timing. In practice, as long as the forget gate allows, the states can be retained through time. This allows them to perform better than RNNs for learning context-free and context-sensitive languages, phonetic labeling, online and offline handwriting recognition, etc. LSTM layers can be stacked arbitrarily deeply to obtain hierarchical representations.
One drawback is that for a given number of hidden units, the number of parameters is quadrupled. To mitigate this, a projected LSTM layer was proposed which does a linear projection after each layer. This makes the hidden state of the network which is recurrently recycled smaller in dimensionality. As before, projected LSTM layers can be stacked to perform deep networks.
At Google, Asynchronous Stochastic Gradient Descent is used, which distributes the gradient descent computation across servers (instead of, e.g., across cores on a GPU). The individual servers contain subsets of the weights, compute gradients for their weights, and communicate with a central server to update their weights. Each machine runs processes on multiple threads, and each thread processes multiple sequences at once. This allows for many thousands of sequences to be processed in parallel. Truncated backpropagation through time is used with around 20 time steps. Also, cell activations are clipped between [-50, 50]. In the end, three forms of asynchrony are present in comparison to standard gradient descent - different parameters are used at different times in each utterance, each replica is being updated independently, and the parameter server is being updated independently.
This system was applied to Google Voice Search in US English. Google has 1900 hours of anonymized, hand-transcribed utterances, resulting in 600 million frames of 40 dimensional log-filterbank features. The network is trying to predict HMM states using a cross-entropy error, where a 3-state HMM is used with 14,000 total context-dependent states. The labels are aligned from the speech transcriptions using a DNN viterbi forced alignment. In order to give the network future context, a 5 frame output delay is used (predict the state 5 frames later). The resulting model has 13 million parameters. To evaluate the performance, s standard speech recognition decoder is used to convert frame posteriors to words. The silence prior is downweighted because so much of the input data is silence. Finally, a huge language model is used to choose the most likely utterance. A variety of network architectures were evaluated; the best performing network without projecting units had 5 layers of 440 cell units with 13 million total parameters, resulting in a 67.6% frame accuracy on the development set with a 10.8% word error rate. The best performing network with projecting units had three layers of 1024 cells with a projection to 512 hidden states with 20 million total parameters resulting in a 69.3% frame accuracy with a 10.7% word error rate.
To improve the accuracy of speech recognition systems, “sequence training” is often used, which attempts to use an alternative criteria to better approximate the word error rate (rather than the frame-wise cross entropy). In practice, this is done by trying to prevent errors that you make when decoding the phonemes using the language model using state-level minimum Bayes risk. This makes the calculation of the outer derivatives more complicated by decoding the entire utterance. The language model can affect the results; using a bigram language model produces the best results and reduces the word error rate to 10% from 10.7% using simply cross-entropy. Using different schemes for aligning the labels to phonemes did not change the results much. In practice, the training criteria is changed from cross-entropy to state-level minimum Bayes risk after it has converged on cross-entropy; the point at which this switch is done can affect the results.