conditional_modeling_for_fun_and_profit

Presenter | Kyle Kastner |

Context | Columbia Neural Network Reading Group/Seminar Series |

Date | 7/28/15 |

Deep learning attempts to learn features from data, with some kind of hierarchy of features going from very low-level to high-level. The hierarchy is of non-linear functions; a common choice is $f(Wx + b)$, where $f$ is some nonlinearity, $W$ is a weight matrix, $x$ is the input, and $b$ is a bias. Currently, the parameters are often optimized by a stochastic optimization technique to minimize some objective value over some data.

Sufficient statistics are the required parameters to describe a given probability distribution. It turns out you can interpret parameters of a layer as sufficient statistics for a certain probability distribution (“Mixture density networks”). Given this, we can use certain kinds of nonlinearities and layer structures to parameterize certain distributions (e.g. sigmoid for bernoulli; softmax for multinomial; linear/linear for Gaussian with mean and log variance; softmax, linear, and linear for GMM, etc). We can combine this idea with recurrence to learn a dynamic distribution over sequences - you are learning the dynamics of sequential data and getting probability densities out.

Autoencoders are networks where the output is intended to reconstruct the input, after some kind of compression to a small representation or some regularization. They are not intended as generative models; an alternative has been proposed (Kingma and Ba) where autoencoders are linked with variational Bayes. It has been shown that given any function to map between the input data to the latent space, there is a lower bound on the KL divergence. The latent space encodes some information about the input data, and we'd like a way to map the input data $x$ to a latent representation $z$. When doing MCMC or EM, sampling $z$ blocks the gradient, but via a reparameterization we can maximize the lower bound of the reconstruction using backpropagation. In practice, a regularization penalty also makes $z$ sparse. To include classification, the class label $y$ is used as an additional parameter; this gives a way to input controls as a prior to the model and allows the model to only generate plausible samples given a certain class. For feedforward networks, this can work by simply concatenating the class label as features with the input.

In convolutional and recurrent networks, we are making the assumption that in certain dimensions we can share parameters. For recurrence, we are assuming that our data occurs sequentially. Recurrent networks are similar to Kalman filters except they essentially learn the state-space. As with feedforward networks, recurrent networks can be used to parameterize probability distributions. In “Generating Sequences with Recurrent Neural Networks”, an RNN is used to parameterize GMMs for handwriting and speech generation, which allows the network to switch dynamics over time. By adding input attention and additional inputs, the network can generate samples from a specific class. This can also be achieved using the RNN language model, an RNN-RBM, or an RNN-NADE. The RNN won't preserve all dynamics correctly - e.g. prosody and style. The RNN is also deterministic itself; the sampling just happens on the output. The representation can also be highly engineered.

One way to condition the RNN is to alter the initial hidden state. It can also be conditioned based on the inputs it has seen. Combining both of these ideas results in the encoder-decoder structure, where an RNN is fed input one step at a time, and the final output is used as the initial state for a recurrent network used for generating sequences. This relies on compressing all of the information in the sequence into a fixed-length vector. An alternative is to use “attention”: Use a bidirectional network so that each hidden state compresses all states before and after and then use an attention mechanism which learns to weight different hidden states. This model can be extended by making the transformation to the hidden state deep and using a raw input representation.

A few extensions to the VRNN have been proposed to add a distributed latent representation to RNNs (VRAE, STORN, DRAW) but they all assume the hidden state $z_t$ is independent (or loosely dependent) over time. In the proposed variational RNN (VRNN), the prior is recurrent, i.e. $z_{< t}$ affects $z_t$ and $h_t$. The VRNN is essentially a VAE at every time-step. The prior is therefore learned, instead of being fixed to $N(0, 1)$ as in the VAE. This enforces consistency over time. The VRNN therefore includes a prior, the ability to generate, the ability to have its state change over time (including being effected by the prior), and the ability to perform inference. On experiments involving natural-sound synthesis, a structured $z$ seems to help - it keeps style consistent, and can help predict correlated data.

conditional_modeling_for_fun_and_profit.txt · Last modified: 2015/12/17 14:59 (external edit)