User Tools

Site Tools


multimodal_deep_learning

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

multimodal_deep_learning [2015/12/17 21:59] (current)
Line 1: Line 1:
 +{{tag>"​Talk Summaries"​ "​Neural Networks"​ "NIPS 2015"​}}
  
 +====== Multimodal Deep Learning ======
 +
 +| Presenter | Russ Salakhutdinov |
 +| Context | NIPS 2015 Multimodal Machine Learning Workshop |
 +| Date | 12/11/15 |
 +
 +Given an image, to understand it, we can try to do things like tag it and create a natural language description. Building a system which can do this automatically is sort of the holy grail of computer vision. ​ A natural approach is to encode sentences and images into the same space, which is analogous to translating one language to another. ​ In practice this is done by learning a joint embedding of images and text, conditioned on anything. ​ To encode images and sentences jointly, one way to do it in practice is to use recurrent vectors to encode sentences and a convolutional network to encode images, both of which are differentiable so can be optimized with respect to a joint objective function. ​ One suitable objective function is a ranking objective function, which requires that matching images and sentences are close together in the embedded space and non-matching images and sentences are far apart. ​ Once embedded, a query image can be used to do nearest-neighbor retrieval to obtain a caption. ​ You can do the same thing with tagging, or the searching by tag (the opposite); training this objective makes this possible sort of on its own.  A better approach than retrieval would be to generate sentences, inspired by neural language models which uses previous words to predict the next word as a conditional probability. ​ When using images, it can be more effective to represent words as a tensor which can be gated by attributes of an image in order to define what the word embeddings should be.  This results in image features gating the hidden-to-output connections in the language model used for predicting the next word.  This makes it very effective at generating captions of images. ​ The model can be augmented with an attention mechanism which allows the model to "​focus"​ on different parts of the image when generating different words.
multimodal_deep_learning.txt ยท Last modified: 2015/12/17 21:59 (external edit)