|Context||NIPS 2015 Deep Learning Symposium|
In loss minimization over training data with stochastic gradient descent, at each iteration we take a minibatch of examples, compute the gradient of the loss for those examples and use the gradient to compute a gradient step. When training a model with sigmoidal nonlinearities, an issue arises when the input to the sigmoid falls into one of the saturated regions because the gradients become small. So, in general, we'd like the sigmoid to have input in an unsaturated region, which in neural network models requires that the output of the previous layer be close to zero. One way to address this is with careful initialization, using small learning rates, and/or using non-saturating nonlinearities such as ReLUs.
Covariate shift is a phenomenon where the training and test distributions are different, which reduces the generalization of the model. Commonly what is done to avoid this is fine-tuning the model on data which comes from the correct distribution. The same phenomenon applies to sub-models too, e.g. subsets of parameters of a larger model which is the input to another. As such a model trains, the distribution of the inputs to the outer transformation will be changing, and so the outer transformation will continually have to adjust to the inner transformation.
The proposed method is a simple way to reduce this by normalizing each activation to have zero mean and unit variance. A natural way to normalize is to compute a running average of examples as examples are fed into the network, but this doesn't work in practice because the model can blow up. What we need instead is the gradients of means with respect to the parameters, so we will modify the model architecture rather than the optimizer. The augmentation used takes advantage of the fact that examples come in minibatches, and computes the minibatch mean and variance and uses them to normalize the activations. An additional scale and shift is required to be learned for each activation, this allows the model to have the same capacity of the original model which did not have normalization. Given the gradients of the loss with respect to the scaled shifted and normalized values, backpropagation can still be achieved across the whole model. When doing inference, the mini-batch statistics are replaced by the population-level mean and variance, computed by maintaining running averages. The resulting batch normalization is then applied both to mini-batch examples and nodes, after the linear transform and before the nonlinearity, which makes the scale of the weight normalized out so that the initial standard deviation matters less.
In a simple 3-layer fully-connected neural network trained on the MNIST task, the distribution of the inputs to the sigmoid changes over time, but when batch normalization is applied they remain much more stable. This leads to the batch normalized model both training faster and achieving higher accuracy.
Batch normalization also was attempted in a larger-scale experiment using the Inception network for ImageNet classification, where it adds a bit of additional computation. In this experiment, the learning rate was able to be increased by 30 times, and reached a higher accuracy than non-normalized much faster. It was also possible to train a network with logistic units with batch normalization, which was not as good as the original network (with ReLUs) but was actually trainable, which was not possible without normalization. With some additional tricks (ensembling), state-of-the-art ImageNet results are possible.