|Context||NIPS 2015 Multimodal Machine Learning Workshop|
Given an image, to understand it, we can try to do things like tag it and create a natural language description. Building a system which can do this automatically is sort of the holy grail of computer vision. A natural approach is to encode sentences and images into the same space, which is analogous to translating one language to another. In practice this is done by learning a joint embedding of images and text, conditioned on anything. To encode images and sentences jointly, one way to do it in practice is to use recurrent vectors to encode sentences and a convolutional network to encode images, both of which are differentiable so can be optimized with respect to a joint objective function. One suitable objective function is a ranking objective function, which requires that matching images and sentences are close together in the embedded space and non-matching images and sentences are far apart. Once embedded, a query image can be used to do nearest-neighbor retrieval to obtain a caption. You can do the same thing with tagging, or the searching by tag (the opposite); training this objective makes this possible sort of on its own. A better approach than retrieval would be to generate sentences, inspired by neural language models which uses previous words to predict the next word as a conditional probability. When using images, it can be more effective to represent words as a tensor which can be gated by attributes of an image in order to define what the word embeddings should be. This results in image features gating the hidden-to-output connections in the language model used for predicting the next word. This makes it very effective at generating captions of images. The model can be augmented with an attention mechanism which allows the model to “focus” on different parts of the image when generating different words.