maxout_networks

Authors | Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, Yoshua Bengio |

Publication info | Retrieved here |

Retrieval date | 9/22/14 |

Presents a new network architecture/nonlinearity which exploits the model averaging properties of dropout.

Dropout is a technique for regularizing neural networks where during each training iteration a random subset of the neural network's hidden units are “dropped” (that is, a random subset units of the neural network are “removed” along with its input and output connections). At test time, the weights of the neural network are then scaled down. The intention is a form of efficient model averaging, where each model is a “thinned” neural network with a subset of the units of the original network. In this view, a network with $n$ units can be seen as a collection of $2^n$ possible “thinned” networks, and dropout will train and average a small subset of these. Because it is not feasible to explicitly average all of the thinned networks which were trained, a simple approximation is used where every weight in the network is scaled from $w$ to $pw$, where $p$ is the probability of dropping out a unit during training. It turns out that for a single layer neural network with a softmax activation (logistic regression), the average prediction of exponentially many sub-models can be computed by running the full model with all weights divided by $2$. This also holds true for a neural network with all linear layers. For more general networks, this weight scaling is only an approximation (Wang and Manning, 2013).

In each layer of a maxout network, instead of computing a single linear mix of the input, $k$ linear mixes are computed. In other words, instead of a mixing matrix and bias vector, a three dimensional mixing tensor and bias matrix are used. So, for each layer with input $x \in \mathbb{R}^d$, a linear mix $z \in \mathbb{R}^{m \times k}$ is computed by $z_{ij} = x^\top W_{:ij} + b_{ij}$ (note that a typical layer computes $z \in \mathbb{R}^m$ by $z_i = x^\top W_{:i} + b_i$). Then, the layer output $h(x) \in \mathbb{R}^m$ is computed by $h_i(x) = \max_{j \in [1, k]} z_{ij}$. In other words, for each dimension of $h$ the maximum $z_i$ from each of the $k$ linear mixes is chosen. A single maxout unit can be interpreted as making a piecewise linear approximation to an arbitrary convex function, so that the activation function is effectiveley learned. Maxout units tend to yield non-sparse activations and are almost never bounded from above.

That a maxout network can approximate any function can be seen by first observing that any continuous piecewise linear function can be expressed as a difference of two convex piecewise linear functions. As mentioned above, a single maxout unit can be seen as making a piecewise linear approximation to any convex function, so the difference of two maxout units (which can be implemented by a linear layer which has weights $1$ and $-1$ to each maxout unit) can approximate any piecewise linear function. Finally, the Stone-Weierstrass approximation theorem states that any function over some compact domain can be approximated arbitrarily well by a piecewise linear function. So, a maxout network can approximate any function.

Maxout networks get state-of-the-art or near state-of-the-art performance on most image processing tasks. On the CIFAR datasets, using the same preprocessing, in order for the performance of a ReLU network to approach that of a maxout network it needed to have $k$ more parameters. Otherwise, the maxout network substantially outperformed the ReLU network whether or not channel pooling was used.

As discussed above, dropout does exact model averaging for a neural network with multiple linear layers. This indicates that dropout will do exact model averaging for any activation provided that it is locally linear among the space of inputs to each layer that are visited by applying different dropout masks. Maxout units trained with dropout may have the identity of the maximal linear mix output change relatively rarely as the dropout mask changes, so that in effect the network is acting like a linear one, making the approximate model averaging more exact (in contrast to a network with nonlinear units). To test this assertion, a tanh network and a maxout network was trained with dropout on MNIST and the test error was computed both for the approximate model averaging and exactly via a geometric mean of a sample of possible outputs. It was found that the KL divergence between the approximate model average and the sampled model average was smaller for maxout, indicating that maxout was doing more effective model averaging.

An additional reason maxout is effective (in particular compared to rectifier units) is that it is easier to train with dropout than a max-pooled ReLU network, particularly with many layers. This is empirically verified by observing that for a deep network with narrow layers, the train and test error increase dramatically once the network gets above 5 or so layers. Furthermore, empirically, during training ReLU units transition from positive to 0 activation than they do from 0 to positive, resulting in more 0 activations. In contrast, with SGD without dropout, rectifier units rarely saturate at 0. When a ReLU unit is saturated at 0, the gradient gets blocked, which partially prevents the network from being adjusted at each iteration. Maxout doesn't have this issue. In practice, if a rectifier is used instead of max in a maxout network, a large proportion of the filters never actually get used. Also in practice, the variance of the gradient for lower layers in a maxout network was higher than with rectifier units, suggesting that maxout may better propagate varying information down the network.

maxout_networks.txt · Last modified: 2015/12/17 14:59 (external edit)