# http://colinraffel.com/wiki/

### Site Tools

new_territory_of_machine_translation

# Differences

This shows you the differences between two versions of the page.

 — new_territory_of_machine_translation [2015/12/17 21:59] (current) 2015/12/13 19:00 craffel created 2015/12/13 19:00 craffel created Line 1: Line 1: + {{tag>"​Talk Summaries"​ "​Neural Networks"​ "NIPS 2015"​}} + ====== New Territory of Machine Translation ====== + + | Presenter | Kyunghyun Cho | + | Context | NIPS 2015 Reasoning, Attention, and Memory Workshop | + | Date | 12/12/15 | + + The path to natural language understanding is not one-dimensional. ​ Currently, machine translation works by taking one sentence at a time, segmenting it into words, tokenizing the words, feeding the words into a machine translation engine, then feeding back a sequence of target words, which are detokenized,​ desegmented,​ and turned into a sentence. ​ The same process goes on for all sentences in the paragraph. ​ We can call this word-level sentence-wise bilingual translation. ​ However, not all languages are a sequence of words, translating by sentence will suffer from not knowing the content of the paragraph, and treating things in a bilingual way does not exploit the shared structure of all languages. + + ===== Neural MT ===== + + Neural macine translation describes sequence-to-sequence/​encoder-decoder models which seek to probabilistically generate a target sequence given the source sequence. ​ This probability is encapsulated in a neural network trained to maximize the log likelihood over a training set.  Optionally the neural network includes an alignment model, which can avoid very large models. ​ This is different from existing approaches (phrase- and rule-based) because it's a single model trained end-to-end. ​ However, it still has the same three issues (word-level,​ sentence-wise,​ bilingual). + + ===== Beyond Character-Level ===== + + To resolve the word-level translation issue, we can go down to subword (character) level translation. ​ Using a word level can handle languages where a word can be made up of subwords whose order changes based on the word.  It also can capture the fact that many words (run, ran, running) have a shared root or meaning. ​ Some languages (e.g. Chinese) also have no spaces, so segmenting can be difficult. ​ Finally, it would be useful to be able to handle typos. ​ So, instead of segmentation and tokenization,​ we can go directly to the character level. ​ The nice thing about neural networks is that they are end-to-end trainable so are flexible to change in representation. ​ One issue is that the relationship between way the word is spelled out to the meaning of the word is highly nonlinear (e.g. quiet, quite, quit). ​ The computational complexity also increases greatly because the sequence length becomes much longer, which causes a linear increase in complexity. ​ One way to mitigate this is to treat each word as a sequence of characters, which is converted to an intermediate representation by a network capable of modeling very complex functions. ​ This has been applied to the encoding side, not decoding. ​ More recently, however, it was shown that you can use the same network (generating words from sequences of characters) to generate the target sequence. ​ However, this still requires segmentation into words. ​ Furthermore,​ some characters are compositional,​ which would be useful structure to exploit. + + ===== Beyond Sentence-Wise ===== + + The information that a paragraph or document has a certain overall topic can be useful when translation;​ it can allow us to narrow down which words we might use.  It can also be crucial for question and answer tasks. ​ Some languages also do not have a future tense; without context it can be impossible to know to use the future tense. ​ Conditioning on preceding sentences in a neural language model didn't work well because getting the context is not that straightforward,​ and using it is not that straightforward. ​ However, by modifying the LSTM network to fuse the context information at a later stage produced better and better perplexity as the context was increased. ​ In particular, nouns, adjectives, and adverbs became more predictable. ​ In general, it's not that clear how to incorporate context information into MT systems. ​ Ideally we could use a lot of context, but this could become very inefficient,​ so something like a hierarchical model may be beneficial. ​ Some work on hierarchical structure has been done on dialogue modeling, with some limited success. + + ===== Beyond Bilingual ===== + + We would expect that there is some knowledge transfer across language, e.g. if we know one language it may help to learn a new language. ​ Some people think this is beneficial for human learners, some don't, but we can test this on machines. ​ Ideally, given $N$ source languages and $M$ target languages, we can have a single model for all $N \times M$ language pairs, but we only have training data for language pairs. ​ In the case of sequence-to-sequence translation,​ it's straightforward - you can encode any language into the same shared vector space, and decode to any language from that space. ​ This produces some kind of language-agnostic sentence vector with some success, but is not straightforwardly applicable to an attention mechanism because there may not be a single alignment across all languages/​modalities. ​ To verify this, an experiment was run on English, German, and Finnish to English and German, with one shared attention mechanism between all language pairs, which provided a small improvement for Finnish which is a small resource language. ​ It was also verified that the multi-language model generalizes and converges better. ​ Ideally, we can extend multilingual to multimodal and multitask learning too. + + ===== In General ===== + + Many of these issues (single modality, ignoring the past, ignoring additional/​unsupervised data) are true of supervised learning in general, which may mean that machine translation and NLP are a good test bed for AI.  For example, moving from bilingual to multilingual is like going to multimodal settings, and environment understanding is like larger-context language processing.