|Context||NIPS 2015 Reasoning, Attention, and Memory Workshop|
Compared to other generative models of images which start from noise, captions provide more information about the image which can narrow down the amount of choices the model must make. However, it requires learning a good language model. But, it still allows for novel compositions by using unusual captions. The general idea is to treat the problem as machine translation, where the caption is translated into the image. The language model is a standard bidirectional LSTM, and the image model is a variational recurrent autoencoder with visual attention, which allows it to be able to determine “where” in the image to generate. The model is trained to maximize the variational lower bound, with a reconstruction term (as in autoencoders) and a KL-divergence term between the inferred distribution and some prior distribution. When computing attention (alignment), the weighting values are computed both based on the state of the image generation model and the language model. When using a simple reconstruction cost, you often get blurry images (maybe due to the weakness of the generative model or the size of the dataset), so a generative adversarial network architecture is used instead of a reconstruction cost.
Trained on the Microsoft COCO dataset, a way to evaluate the model is to take an existing caption and modifying it slightly (changing colors, objects, etc. requested). Changing colors usually achieved the desired transformation, but changing objects didn't always produce any discernible difference. A qualitative picture of the model's behavior can be seen by looking at which words in the caption were used most strongly when generating the caption. In quantitative analysis, the simplest measure is to show that log-likelihood is not much different between train and testing, and only varied slightly for different language models. The model produced more blurry images than the Laplacian Pyramid Generative Adversarial Network model but they were arguably closer to what was requested in the caption.