|Context||NIPS 2015 Reasoning, Attention, and Memory Workshop|
The path to natural language understanding is not one-dimensional. Currently, machine translation works by taking one sentence at a time, segmenting it into words, tokenizing the words, feeding the words into a machine translation engine, then feeding back a sequence of target words, which are detokenized, desegmented, and turned into a sentence. The same process goes on for all sentences in the paragraph. We can call this word-level sentence-wise bilingual translation. However, not all languages are a sequence of words, translating by sentence will suffer from not knowing the content of the paragraph, and treating things in a bilingual way does not exploit the shared structure of all languages.
Neural macine translation describes sequence-to-sequence/encoder-decoder models which seek to probabilistically generate a target sequence given the source sequence. This probability is encapsulated in a neural network trained to maximize the log likelihood over a training set. Optionally the neural network includes an alignment model, which can avoid very large models. This is different from existing approaches (phrase- and rule-based) because it's a single model trained end-to-end. However, it still has the same three issues (word-level, sentence-wise, bilingual).
To resolve the word-level translation issue, we can go down to subword (character) level translation. Using a word level can handle languages where a word can be made up of subwords whose order changes based on the word. It also can capture the fact that many words (run, ran, running) have a shared root or meaning. Some languages (e.g. Chinese) also have no spaces, so segmenting can be difficult. Finally, it would be useful to be able to handle typos. So, instead of segmentation and tokenization, we can go directly to the character level. The nice thing about neural networks is that they are end-to-end trainable so are flexible to change in representation. One issue is that the relationship between way the word is spelled out to the meaning of the word is highly nonlinear (e.g. quiet, quite, quit). The computational complexity also increases greatly because the sequence length becomes much longer, which causes a linear increase in complexity. One way to mitigate this is to treat each word as a sequence of characters, which is converted to an intermediate representation by a network capable of modeling very complex functions. This has been applied to the encoding side, not decoding. More recently, however, it was shown that you can use the same network (generating words from sequences of characters) to generate the target sequence. However, this still requires segmentation into words. Furthermore, some characters are compositional, which would be useful structure to exploit.
The information that a paragraph or document has a certain overall topic can be useful when translation; it can allow us to narrow down which words we might use. It can also be crucial for question and answer tasks. Some languages also do not have a future tense; without context it can be impossible to know to use the future tense. Conditioning on preceding sentences in a neural language model didn't work well because getting the context is not that straightforward, and using it is not that straightforward. However, by modifying the LSTM network to fuse the context information at a later stage produced better and better perplexity as the context was increased. In particular, nouns, adjectives, and adverbs became more predictable. In general, it's not that clear how to incorporate context information into MT systems. Ideally we could use a lot of context, but this could become very inefficient, so something like a hierarchical model may be beneficial. Some work on hierarchical structure has been done on dialogue modeling, with some limited success.
We would expect that there is some knowledge transfer across language, e.g. if we know one language it may help to learn a new language. Some people think this is beneficial for human learners, some don't, but we can test this on machines. Ideally, given $N$ source languages and $M$ target languages, we can have a single model for all $N \times M$ language pairs, but we only have training data for language pairs. In the case of sequence-to-sequence translation, it's straightforward - you can encode any language into the same shared vector space, and decode to any language from that space. This produces some kind of language-agnostic sentence vector with some success, but is not straightforwardly applicable to an attention mechanism because there may not be a single alignment across all languages/modalities. To verify this, an experiment was run on English, German, and Finnish to English and German, with one shared attention mechanism between all language pairs, which provided a small improvement for Finnish which is a small resource language. It was also verified that the multi-language model generalizes and converges better. Ideally, we can extend multilingual to multimodal and multitask learning too.
Many of these issues (single modality, ignoring the past, ignoring additional/unsupervised data) are true of supervised learning in general, which may mean that machine translation and NLP are a good test bed for AI. For example, moving from bilingual to multilingual is like going to multimodal settings, and environment understanding is like larger-context language processing.