|Context||McGill CIRMMT Special Talk|
Doug wants to generate music, video, images, and text using a machine, but the world doesn't want a bunch of computer-generated bach-like music. So, completely artificial music is maybe not that interesting, but could be useful as an adaptive soundtrack personalized realtime for your life. Also, just generation is not enough, it's important to have models which pay attention to surprised and attention - surprising meaning manipulating the listener/user in one way or another (expectancy and violation of expectancy), and with surprise comes some need to maintain attention. When creating computer-generated anything (including music and art), evaluation is important. It's also important to create tools and data which help others do the same, and help people learn about these tools and use them for education.
The simplest description of deep learning is that it is a collection of algorithms which is based on multiple levels of learning and abstraction. Machine learning in general is focused on learning some sort of transformation of data to achieve some task, rather than using a rule-based system. Deep learning tries to avoid hand-designing the data representation, and making the entire learning system end-to-end learnable, by using progressive layers of nonlinearity. It has turned out to be compelling because it works well - state-of-the-art image recognition, speech recognition, and machine translation. In practice, having hierarchical representations makes is to that features low in the hierarchy are generic, and increase in level of abstraction until they are looking for a specific structure. Rather than overfitting, the result can be used as a sort of semantic concept of different classes which generalizes well.
DeepDream takes a network trained for something like image object recognition and optimizes the data input to the network, rather than the weights, in order to maximally activate an individual node in the network. Then, you can look at the resulting optimized image, and you start to see abstract class concepts in the image. If you start with random noise, you get “class textures”; if you start with a real image it modifies the image objects modified to look like class concepts. Given that the deep network works, it's not a surprise that this works.
Another application trains a model which tries to decorrelate style from content (much in the way that MFCCs decorrelate timbre from the rest of a signal). You can then take a source picture and “restylize” it using another image, such as a painting. The result is a preservation of the content of the image, but with the “style” of the painter represented. It doesn't reimagine the geometry of the source according to the artist's techniques, but it does provide a sort of “artist filter”.
Generative adversarial networks are unsupervised models which combine a generative model which tries to output a certain type of data (e.g. image) and a discriminative model which tries to decide whether an image is fake (generated by the other network) or real-world example. The models are trained together, so that for example the generative network “learns what it's doing wrong”. Success is then when the discriminator can't tell the difference anymore. Using convolutional networks, very realistic images can be generated. It may be only learning a tiny subspace of the resulting area - it certainly can't learn to generate completely arbitrary images.
Attention-based networks include a mechanism which learn to focus on a certain part of their input, which results in a model which can learn to associate, for example, words with what the word is describing in the image. This can be used to analyze errors, by looking at what the model was attending to to inputs for which it made a mistake.
Finally, recurrent networks have been used to generate sequences, rather than fixed images. Recurrent networks have a self-connection so that it can have some kind of state, and can store information over time. The problem with them is that when trained with gradient descent, the parameters of the network tend to either explode or vanish. The long-short term memory architecture can partially mitigate this. This can help learn some structure in sequences, but in general may not have the same level of structure as real data. It has been applied to generating music, wikipedia articles, handwriting, etc.
Given all of these models which can produce interesting output, some questions remain: First, why do this at all, given that we have humans who can do this quite well? One practical reason is that much more personalized art and media can be create. It's also interesting as a way to study what art is, and what surprise and attention are and how they relate to art. One way to achieve surprise is to have multiple domains interplay, which is a possible thing to model as similarity.