|Context||Columbia Neural Network Reading Group/Seminar Series|
In 1949, Warren Weaver wrote a memorandum called “Translation”, which arguably predicted everything that has happened in machine translation over the next 60 years. In 1991-1993, IBM proposed “statistical machine translation”, which is what is used by Google translate and many other systems. Statistical machine translation approximates a conditional distribution of the translation given a source sentence. Bayes rule separates this conditional probability into a “translation model” and a “language model”. In reality, statistical machine translation is messy. In order to approximate the conditional distribution people are using a log-linear model. In a linear model, the most important thing becomes computing nice features, because the function you are approximating is nonlinear (which is what feature extraction encapsulates). Nowadays, there are sometimes upwards of 100 features used in statistical machine translation, and most of the progress has been in designing a better feature function. Finally, a strong, external language model is used, which is used to filter out the unlikely or ungrammatical sentences.
Instead of extracting features by hand and using a log-linear model, the entire problem could be treated as a supervised learning problem using, for example, neural networks. Interestingly, this was first proposed in 1997, but at the time the neural nets needed to be very large and it was therefore computationally prohibitive. Recently, this method was used again by Cho, Kalchbrenner, and Sutskever and others. The basic idea is to use an “encoder” neural network to read the source sentence, one word at a time. Then, based on this encoding, a decoder converts the state to a new sentence. The words are fed into the network as a 1-of-K encoding, which basically doesn't encode any knowledge about the sentence - all words are equidistance from one another; there's no structure. Each of the one-hot vectors are multiplied by a projection matrix, into a continuous space, where the projection is learned. The encoder network then outputs a hidden state at time based on the previous state and the current projected word by. Finally, the decoder predicts a word based on the previous output word and the state of the encoder network after processing the entire sequence. The probability of the next word is then computed according to a language model, using beam search. Interestingly, when training, this model learns to predict (garbage) sentences of the correct length first. The word embedding matrix also results in different embeddings depending on what task it is being trained on.
This model is very unrealistic, because it requires that the model read the entire sentence and then predict a translation without “looking” at the source sentence again. In other words, “you can’t cram the meaning of a whole sentence in a single vector”. Instead, a better approach may be to allow the decoder to have access to the entire source sentence when decoding. This can be achieved by computing a sequence of attention vectors, which compute how important each state is in the source sentence, and are computed based on the current context using on a learned transformation. So, instead of one state being used as the representation of the source sentence, a weighted average of the encoder states is used. This model works very well, and also allows for visualization of the attention weights, which allows us to see what the model is “focusing on” when it predicts a given word. The attention allows for the model to make a prediction about a word based on a word that has already appeared in the source sentence, or will appear later on in the sentence. The attention-based model produces large accuracy gains over a model which encodes the entire source sentence in a fixed-length vector.
One issue with this model is that the computational complexity grows linearly with the size of the input vocabulary, which can be prohibitively expensive when the vocabulary is millions of words. Many approximate methods sacrifice normalization in the probability, so an approximate biased sampling procedure was proposed. The approximate vocabulary is chosen based on subsets of the training set at training time, and at test time is built based on which words were aligned in the testing set. The resulting system achieved the best results in a benchmark task for some language pairs. A way to further improve these results is to segment words into short character-grams, which can then be translated on a subword level. This can further be extended on a character level, but the transformation must be very nonlinear.
A simple extension would be multiple input sentences, multiple output sentences, which would allow for a universal language space. Also, most work is done on sentence-to-sentence translation. To extend this, the state after a sentence can depend on the state after a previous sentence. Finally, similar models can be used for a variety of tasks - any which can be seen as transducing sequences. For example, image caption generation can be accomplished by replacing the encoder with a convolutional network, and the attention can also be plotted which allows visualization of which part of the image is being attended to when it’s predicting a given word. Other tasks which have shown promising results in the attention framework: video caption generation, speech recognition…