# http://colinraffel.com/wiki/

### Site Tools

towards_end-to-end_speech_recognition

# Towards End-To-End Speech Recognition

 Presenter Tara Sainath Context Columbia Neural Network Reading Group/Seminar Series Date 9/16/15

How can we remove different parts of the speech recognition pipeline and replace it with neural networks? The speech recognition problem can be formulated as trying to determine the best set of hypothesized words given an audio waveform, which can be broken up into an acoustic model (mapping from waveform to words) and a language model (probabilities of words). Typically, a pipeline first extracts features from the audio and maps it to individual states, then converts those states to phonemes, then words and sentences. The acoustic model portion includes feature conversion and its mapping to phoneme states. This talk focuses on simplifying the acoustic model. The most common features used for neural networks are log-mel filterbank features, which are computed over windowed segments of the audio waveform. Ideally, we could learn this feature transformation in order to potentially learn a better feature representation. Once features are computed, we need to convert them to states - GMMs were traditionally used, but neural networks are used for state-of-the-art results now. The acoustic model maps acoustic features to phone states, but the number of states is huge when you consider triphones (phonemes with left and right context). So, typically the number of states is compressed. But ideally we could also replace this portion of the model with a learning-based approach.

## The CLDNN

When training a neural network, an existing model is usually used to align the training speech waveforms to the word labels in order to get per-frame labels. Two ways this was typically done was with viterbi alignment which assigns a hard label to each frame, or with baum-welch, which assigns a soft label. Connectionist Temporal Classification (CTC) circumvents this by learning the alignment jointly while training the acoustic model. When training with cross-entropy, a label must be predicted at each frame; CTC allows a “blank” label so that only some frames have a predicted label. After outputting these labels with blanks, repeated labels without separating blanks are merged. Cross-entropy tries to maximize the correct class at every single frame with the frame-level alignment; in CTC, a “lattice” $z^l$ is defined which encodes all possible alignments of $x$ with $l$. A forward-backward probability on $p(z^l | x)$ is then computed and the error is backpropagated. This results in models which usually produce sharper transitions. The resulting approach can improve word error rates over using context-dependent states, particularly when using a bidirectional model (only LSTM models were tested, not CLDNN models). Future work will combine raw-waveform CLDNNs with CTC.