|Context||Columbia Neural Network Reading Group/Seminar Series|
The goal is to create a parametric generative model of natural images. Most existing techniques could only generate small, simple images and/or low-quality images. The basic idea is to initially focus on modeling a low-resolution image space, then build a sequence of conditional image models which iteratively improve the generated image, all using the generative adversarial network framework.
Generative modeling can be framed as having access to some data distribution some finite training set,and you want to learn a generative model of that data. There are many ways to quantify the “goodness” of the model, including how well samples from the model resemble entries in the training set and/or have training data have high likelihood in the model. Generative models can be part of a representation learning framework (e.g. encoder/decoder), in unsupervised learning tasks, or in density estimation to learn the structure of the data. As of a few months ago, most generative models of these images produced images which did produce realistic samples for simple 32×32 images (CIFAR).
Maximum likelihood Generative adversarial networks attempt to learn how to draw good samples by defining two networks and training them in opposition to one another. The generative model maps from a prior noise distribution to a data space, and a disctriminative model tries to decide if an image is real or fake. The loss function is then defined to try to improve the generative model's capability to fool the discriminator. The generator has access to the gradients of the discriminator, which should be able to help it improve its generative process. There are many heuristics to choose the capacity of the discriminator of the generator and discriminator, the correct learning rate, and the number of steps to train the discriminator vs. the generator. Getting these heuristics right is crucial to the success of training the system. Conditional generative adversarial networks condition the generative model with an additional piece of information, e.g. model class. The discriminative model is also given this information.
The Laplacian pyramid is an invertible multi-scale image representation, which downsamples an image by repeated factors of two and stores the residual which is missing from the each step in the pyramid. The residual is computed from a Gaussian-upsampled version of each downsampled step. The result is one very low-resolution image, plus a sequence of residuals. To learn a conditional generative adversarial network on this representation, it tries to generate the residual conditioned on each individual scale. At each scale of the pyramid, an example of a “real” input is the residual of the downsampled version; the generative network takes in the downsampled version and then produces an estimate of the residual. The discriminator takes in both the high frequency image and the downsampled image. Each GAN at each step is trained independently. Sampling from the model is akin to reconstructing from the Laplacian pyramid, where first a low-resolution image is generated from noise (as in a normal GAN), then this image is passed to the next GAN which produces a residual to compute a higher-resolution image, in so on. The whole model was not trained end-to-end because this would make it so that each low-resolution ground-truth image only mapped to one high-resolution image, which allows for exploring different possible outputs given one low-resolution input. Fine-tuning would be possible, but wasn't done.
On the CIFAR-10 dataset (32×32, 10 classes), each generative network took in 4 channels (three channels of color plus one channel of noise) and produced a residual image. Each discriminator was trained on the residual plus the down/upsampled image, plus a one-hot indicator of the class label. Minibatches consisted of a random sampling of real and a random sampling of fake images. Some of the resulting images looked realistic, some looked a bit blurry; this could be due to propagating errors. Nearest neighbor analysis in pixel space, feature space, and in a transformation space indicated that the model was not learning a lookup table/memorizing the input images. In a human evaluation, humans were classified about 90% of real images as real for reasonable presentation times, but were only classified about 40% of generated images as real. The original GAN produced images which were classified as real only about 10% of the time.
The second experiment was on the LSUN dataset (10 million 64×64 images, 10 classes). A larger convolutional network was used, whose structure was chosen with cross-validation. The first or second stage was pretty crucial - it chose where to put objects, which were simply refined at further stages. The resulting images were realistic-looking. In general, GANs really try not to place density where there is none in the input space, unlike say a L2 loss which may do a lot of smoothing. Another way to confirm that the network isn't overfitting is to feed in the same initial low-resolution image and sample multiple times to produce multiple different images.
More recently, Radford et. al. wrote up a bunch of tricks (e.g. batch normalization, deepness) which help train the GANs. These tricks are orthogonal to the LAPGAN method, so they could be combined to produce even nicer images. These results could also be extended to an autoencoder framework to do unsupervised learning. One key area for exploration is making the GAN models easier to train.