User Tools

Site Tools


new_territory_of_machine_translation

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

new_territory_of_machine_translation [2015/12/17 21:59] (current)
Line 1: Line 1:
 +{{tag>"​Talk Summaries"​ "​Neural Networks"​ "NIPS 2015"​}}
  
 +====== New Territory of Machine Translation ======
 +
 +| Presenter | Kyunghyun Cho |
 +| Context | NIPS 2015 Reasoning, Attention, and Memory Workshop |
 +| Date | 12/12/15 |
 +
 +The path to natural language understanding is not one-dimensional. ​ Currently, machine translation works by taking one sentence at a time, segmenting it into words, tokenizing the words, feeding the words into a machine translation engine, then feeding back a sequence of target words, which are detokenized,​ desegmented,​ and turned into a sentence. ​ The same process goes on for all sentences in the paragraph. ​ We can call this word-level sentence-wise bilingual translation. ​ However, not all languages are a sequence of words, translating by sentence will suffer from not knowing the content of the paragraph, and treating things in a bilingual way does not exploit the shared structure of all languages.
 +
 +===== Neural MT =====
 +
 +Neural macine translation describes sequence-to-sequence/​encoder-decoder models which seek to probabilistically generate a target sequence given the source sequence. ​ This probability is encapsulated in a neural network trained to maximize the log likelihood over a training set.  Optionally the neural network includes an alignment model, which can avoid very large models. ​ This is different from existing approaches (phrase- and rule-based) because it's a single model trained end-to-end. ​ However, it still has the same three issues (word-level,​ sentence-wise,​ bilingual).
 +
 +===== Beyond Character-Level =====
 +
 +To resolve the word-level translation issue, we can go down to subword (character) level translation. ​ Using a word level can handle languages where a word can be made up of subwords whose order changes based on the word.  It also can capture the fact that many words (run, ran, running) have a shared root or meaning. ​ Some languages (e.g. Chinese) also have no spaces, so segmenting can be difficult. ​ Finally, it would be useful to be able to handle typos. ​ So, instead of segmentation and tokenization,​ we can go directly to the character level. ​ The nice thing about neural networks is that they are end-to-end trainable so are flexible to change in representation. ​ One issue is that the relationship between way the word is spelled out to the meaning of the word is highly nonlinear (e.g. quiet, quite, quit). ​ The computational complexity also increases greatly because the sequence length becomes much longer, which causes a linear increase in complexity. ​ One way to mitigate this is to treat each word as a sequence of characters, which is converted to an intermediate representation by a network capable of modeling very complex functions. ​ This has been applied to the encoding side, not decoding. ​ More recently, however, it was shown that you can use the same network (generating words from sequences of characters) to generate the target sequence. ​ However, this still requires segmentation into words. ​ Furthermore,​ some characters are compositional,​ which would be useful structure to exploit.
 +
 +===== Beyond Sentence-Wise =====
 +
 +The information that a paragraph or document has a certain overall topic can be useful when translation;​ it can allow us to narrow down which words we might use.  It can also be crucial for question and answer tasks. ​ Some languages also do not have a future tense; without context it can be impossible to know to use the future tense. ​ Conditioning on preceding sentences in a neural language model didn't work well because getting the context is not that straightforward,​ and using it is not that straightforward. ​ However, by modifying the LSTM network to fuse the context information at a later stage produced better and better perplexity as the context was increased. ​ In particular, nouns, adjectives, and adverbs became more predictable. ​ In general, it's not that clear how to incorporate context information into MT systems. ​ Ideally we could use a lot of context, but this could become very inefficient,​ so something like a hierarchical model may be beneficial. ​ Some work on hierarchical structure has been done on dialogue modeling, with some limited success.
 +
 +===== Beyond Bilingual =====
 +
 +We would expect that there is some knowledge transfer across language, e.g. if we know one language it may help to learn a new language. ​ Some people think this is beneficial for human learners, some don't, but we can test this on machines. ​ Ideally, given $N$ source languages and $M$ target languages, we can have a single model for all $N \times M$ language pairs, but we only have training data for language pairs. ​ In the case of sequence-to-sequence translation,​ it's straightforward - you can encode any language into the same shared vector space, and decode to any language from that space. ​ This produces some kind of language-agnostic sentence vector with some success, but is not straightforwardly applicable to an attention mechanism because there may not be a single alignment across all languages/​modalities. ​ To verify this, an experiment was run on English, German, and Finnish to English and German, with one shared attention mechanism between all language pairs, which provided a small improvement for Finnish which is a small resource language. ​ It was also verified that the multi-language model generalizes and converges better. ​ Ideally, we can extend multilingual to multimodal and multitask learning too.
 +
 +===== In General =====
 +
 +Many of these issues (single modality, ignoring the past, ignoring additional/​unsupervised data) are true of supervised learning in general, which may mean that machine translation and NLP are a good test bed for AI.  For example, moving from bilingual to multilingual is like going to multimodal settings, and environment understanding is like larger-context language processing.
new_territory_of_machine_translation.txt ยท Last modified: 2015/12/17 21:59 (external edit)