|Presenter||Bart van Merrienboer|
|Context||Columbia Neural Network Reading Group/Seminar Series|
To model translation probabilistically, the goal is to find a translation in a target language which maximizes some probability given the sentence in the source language. Part of this probability involves the language model, which computes the probability of a word given the words seen previously. The language model can be approximated as a n-gram, where we only consider the last n words. Neural networks, including RNNs, have been used as the language model for a long time. The state of the art is currently using ensembles of RNNs. The translation model, the other part of the probability, is more difficult to estimate. Usually it is modeled as a big log-linear model based on a huge ensemble of features, including aligned n-gram co-occurrence counts. Then, to find a target sentence, you do a massive beam search over all possible combinations of n-grams. This can be seen as a “direct” approach, where you simply trying to perform translation sort of word-for-word. This causes issues because of locality - because of morphology, cases, syntactical relationships. An ideal approach would be to use some kind of inter-lingua representation, where the source sentence is mapped to some sort of “understanding” and the target sentence is mapped from that.
Encoder/decoder architectures try to do this “ideal” thing, where the source sentence is encoded as a fixed-length vector and then the fixed-length vector is used to be decoded into a target sentence. To get this to work, more sophisticated RNNs must be used - with a gating mechanism (GRU or LSTM). Also, many additional connections must be made, between the encoded representation, the previously output word, the previous decoder state, etc. It's also critical to do beam search. These models are also large, and can take a long time to train. The resulting model worked well, but above a certain sentence length, the performance dropped, which is most likely because there is a “bottleneck” in the form of the encoding vector, of which it is required that the entire sentence bet summarized. Also, unseen words caused issues - this is less of an issue using traditional methods which can just use a large look-up table. Google proposed a very large model which worked well up to typical length sentences. Reversing the input sentence helped a lot because then the decoder network would be outputting the start of the sentence after seeing it the most recently.
U Montreal came up with a different model which used “attention”. The source sentence was fed into a bidirectional recurrent network, and at each time step a weight vector is computed which essentially denotes where in the hidden state of the bidirectional RNN to “pay attention” to. The weight vector is computed based on the current encoder and decoder states using a small feedforward network. This makes the model perform much better on longer sentences.
The main computational complexity of the network is in the softmax layer, because the number of target possible words is huge. A way to mitigate this is to use a softmax approximation, where you look ahead to see which words are going to need to be used and use them as an embedding and predict values in the embedded space. At test time you do a full soft-max. Translation quality can be improved by utilizing a monolingual language model, where the extent to which it is used can also be used. Also, typically the network is trained always knowing what the correct previous word is (teacher forcing), but this makes it unable to know what to do when it produces the wrong word. An alternative is “selective sampling”, where once the network is trained to produce reliable alignments, the previously produced word can be used instead.
A disadvantage of the attention approach is that you don't end up with fixed-length embeddings which represent sentences, so applying it to a many-language to many-language translation setting may not be feasible. But, being able to capture all of the information in a sentences language-invariant way requires a lot of modeling power. Another disadvantage is that for every word in the target sentence, you need to compute an attention over all words in the target sentence, which makes it very expensive to do things like document summarization.
Compared to older statistical machine translation methods, the neural machine translation models tend to produce more “fluent” sounding sentences, because they are not “stitching together” phrases based on an n-gram. However, sometimes the words which carry more meaning are lower-frequency, so the neural machine translation model will suffer. One method which might help deal with this would be to try to predict unknown words at the character level. Performing actual quantitative comparison between the two can be difficult, then, because different metrics will emphasize that different things are important.
Memory networks come from a different perspective, but are similar. They were designed to be able to reasoning. They utilize a neural network which maps an input into some representation, and then another network which decides where to put that representation into a “memory”. Then, there's a network which can read from the memory and decode it back into the desired representation. An end-to-end learnable formulation was proposed, which can be viewed as the attention structure, except that there is only one target at the output. The output is still iterated over multiple steps, which allows it to refine its output and “reason”.
Theano allows you to formulate a mathematical expression in Python code, and then will compile it to CUDA kernels or efficient C++ code. It also allows you to do symbolic differentiation. Blocks is a deep learning framework built on top of Theano. It consists of “bricks” which are common operations in neural networks, like a linear transformation with a bias. Bricks annotate the computational graph automatically which makes it easier to inspect and debug. It also includes functionality to train networks, monitor results over batches, log experiments, “main loops”, etc. It was designed with RNN capabilities in mind so it should be able to build RNN models without difficulty, in addition to feedforward networks. It also was designed as a research tool, so there is monitoring and sanity checks as needed. Serialization tools exist which allow you to serialize the state of the model, training position, etc. so you can start and stop completely at any point.
Fuel streamlines the process of reading data. It is able to read many different data inputs and allows iteration over and process the data in different ways. The loading/streaming/processig pipeline can be serialized at any point in time. As an application example, fuel can read two text files, merge/zip them, and produce batches of sentences that are supposed to be sorted according to their length.