neural_network_hyperparameters

Most machine learning algorithms involve “hyperparameters” which are variables set before actually optimizing the model's parameters. Setting the values of hyperparameters can be seen as model selection, i.e. choosing which model to use from the hypothesized set of possible models. Hyperparameters are often set by hand, selected by some search algorithm, or optimized by some “hyper-learner”.

Neural networks can have many hyperparameters, including those which specify the structure of the network itself and those which determine how the network is trained. This document describes the hyperparameters typically encountered when training neural networks and covers some common techniques for setting them, following the discussion in Section 3 of
(Bengio 2012)^{1)}. In particular, we will focus on feed-forward neural nets trained with mini-batch gradient descent. When appropriate, recommendations for hyperparameter choices are given, which should be taken with many grains of salt.

When training a neural network, the resulting model will depend not only on the chosen structure but also on the training method used to set the network's parameters. The training method itself can have many hyperparameters. Here, we describe the hyperparameters of mini-batch gradient descent, which updates the network's parameters using gradient descent on a subset of the training data (which is periodically shuffled, or assumed infinite). We'll define the $t^{th} < T$ mini-batch gradient descent update of network parameters $\theta$ as

$$ \theta^{(t)} \leftarrow \theta^{(t - 1)} - \epsilon_t \frac{1}{B} \sum_{t^\prime = Bt + 1}^{B(t + 1)} \frac{\partial L(z_{t^\prime}, \theta)}{\partial\theta} $$

where $z_{t^\prime}$ is example $t^\prime$ in the training set and the hyperparameters are the loss function $L$, the learning rate at step $t$ $\epsilon_t$, the mini-batch size $B$, and the number of iterations $T$. Note that we must define $\theta^{(0)}$, which is also a hyperparameter.

The learning rate $\epsilon_t$ determines how “quickly” the gradient updates follow the gradient direction. If the learning rate is very small, the model will converge too slowly; if the learning rate is too large, the model will diverge. When $\epsilon_t$ varies over iterations, we must define an initial learning rate $\epsilon_0$ and a way to compute $\epsilon_t$. For neural nets with standardized inputs or inputs bounded on $[0, 1]$, $\epsilon_0$ is typically set between $1$ and $10^{-6}$ and a good value to try is $.01$. However, due to $\epsilon_0$'s importance it is critical that it is tuned.

The learning rate $\epsilon_t$ is typically decreased over time; the heuristic being that at first a decent parameter setting should be found, after which it should be fine-tuned. One approach is to set $\epsilon_t = \epsilon_0$ when $t < \tau$ (now $\tau$ is a new hyperparameter) after which point we set $\epsilon_t = \epsilon_0/t^\alpha$ ($\alpha$ is another hyperparameter). “Smaller” values of $\alpha$ should be used when the objective is very non-convex and some kind of gradient averaging (e.g. momentum) is used. It's also common to set $\tau$ adaptively to the iteration at which the training criterion stops decreasing significantly.

The loss function compares the network's output for a training example against the intended ground truth output. A common general-purpose loss function is the squared Euclidian distance, given by

$$ L = \frac{1}{2} \sum_i (y_i - z_i)^2 $$

where $y_i$ is the output of the $i^{th}$ network output unit, and $z_i$ is the $i^{th}$ value of the target output. The $1/2$ is included by convention to make its gradient more simple. When the output of the neural network is being treated as a probability distribution (e.g. a softmax output layer is being used), it's common and more theoretically valid to use the cross-entropy loss, defined by

$$ L = -\sum_i y_i \log(z_i) $$

Theoreticlaly, the choice of $B$ is mostly computational - when $B$ is larger, updates can be computed more efficiently due to the use of parallel architectures; when $B$ is smaller, more updates can be made. Due to the fact that it should not effect generalization/test performance, $B$ can be optimized separately from other hyperparameters (e.g. can be set first and then fixed while the other hyperparameters, aside from momentum, are optimized).

The most common way to set $T$ is using the principle of early stopping. Early stopping simply stops training once performance on a held-out validation set stops increasing (i.e. the cost begins increasing steadily instead of decreasing). This can be a powerful way to prevent overfitting, to the extent that it may make the choice of other hyperparameters less important. The validation performance should be evaluated infrequently; for example every time the algorithm has seen several times more new examples than there are in the validation set.

A very common technique is to “smooth” the gradient updates using a leaky integrator filter with parameter $\beta$ by

$$ \bar{g} \leftarrow (1 - \beta)\bar{g} + \beta \frac{\partial L(z_t, \theta)}{\partial \theta} $$

$\bar{g}$ can then be used in place of the “true” gradient update in gradient descent. Some mathematically motivated approaches can ensure much faster convergence when using appropriate momentum, however for pure stochastic gradient descent standard gradient updates ($\beta = 1$) with a harmonically decreasing learning rate is optimal.

The structure of the neural network itself involves numerous hyperparameters in its design, including the size and nonlinearity of each layer. The numeric properties of the weights are often also constrained in some way, and their initialization can have a strong effect on model performance. Finally, preprocessing of the input data can also be important for ensuring convergence. As a practical note, many hyperparameters can vary across layers.

Large hidden layers can allow the neural network to fit the training data arbitrarily well, but because regularization is typically used, it’s mostly important to just use large hidden layers. Using the same size for all hidden layers generally works better or the same as using a decreasing or increasing size. In addition, using a first hidden layer which is larger than the input layer tends to work better. When using unsupervised pre-training, the layers should be made much bigger than when doing purely supervised optimization.

To reduce overfitting, a regularization on the network weights is sometimes added to the training criterion (loss function). When encouraging the network weights $\theta$ to be close to zero (the typical case), L2 regularization adds $\lambda_2 \sum_i \theta_i^2$ while L1 adds $\lambda_1 \sum_i |\theta_i|$ where $\lambda_2$ and $\lambda_1$ are lagrange multipliers which determine how “important” this regularization should be considered to be.

This regularization can be viewed as a negative log-prior on the parameters. In this interpretation, in the mini-batch case, the gradient of the regularization penalty should be multiplied by $B/T$ where $T$ is the training set size. In the online setting, $B/t$ should be used. L2 regularization corresponds to a Gaussian prior with variance $\sigma^2 = 1/(2\lambda_2)$, penalizing large weight values. L1 regularization corresponds to a Laplace density prior with a scale parameter $s = 1/\lambda_1$, acting as a form of feature selection. It is common to constrain only the input and/or output weights.

It may be advantageous for the hidden unit activations to be sparse. One way to enforce this is to use an L1 penalty (as discussed above) on the hidden unit parameters, provided that the activation has a saturating output around $0$ (such as sigmoid, and not tanh). Other common sparsity-inducing regularizers include the Student-t $\log(1 + h_j^2)$, mean-squared $\sum_j (\rho - \bar{h}_j)^2$ and KL-divergence $-\rho \log(\bar{h}_j) - (1 - \rho)\log(1 - \bar{h}_j) + C$ where $h_j$ is the $j^{th}$ activation of hidden layer $h$, $\bar{h}$ is the average activation over, e.g., a mini-batch, and $\rho$ and $C$ are additional hyperparameters.

Commonly used nonlinearities include: The sigmoid $1/(1 + e^{-a})$ which smoothly “squashes” its output into the range $[0, 1]$, tanh $\frac{e^a - e^{-a}}{e^a + e^{-a}}$ which is equivalent (up to scaling) to sigmoid with range $[-1, 1]$, the rectifier (ReLU) $\mathrm{max}(0, a)$, and the “hard tanh” or step function. Sigmoid activation on the output layer can cause issues with deep networks trained in a supervised fashion. Similarly, “hard” units (ReLU and step) don’t make sense for the output layer, because when the output is saturated, no error gradient is passed back into the network.

Biases are typically initialized to $0$, but weights must be initialized carefully; their initialization can have a big impact on the local minimum found by the training algorithm. When initialized randomly, weights are often initialized in a range of $[-r, r]$ where $r$ is the inverse of the square root of the fan-in (sometimes added to the fan-out) of the unit. For RBMs, a zero-mean Gaussian with a small standard deviation ($0.1$ or $0.01$) should be used. Unsupervised pre-training can be seen as essentially a sophisticated initialization technique, which is considered in most settings “to help and very rarely to hurt”. Common unsupervised initialization techniques include greedily stacked RBMs (DBM) or denoising/contractive auto encoders.

Many of the processes involved in training a neural network involve using a random number generator (e.g. random sampling of training data, weight initialization, etc). As a result, the seed passed to the random number generator can have a slight effect on the results. However, a different random seed can produce a non-trivially different model (even if it performs about as well). As a result, it’s common to train multiple models with multiple random seeds and use model averaging (bagging, Bayesian methods) to improve performance.

The statistics of the input data can have a strong effect on network performance. Element-wise standardization (subtract the mean and divide by the standard deviation), Principal Component Analysis, uniformization (transform each feature value to its approximate normalized rank or quantile), and nonlinearities such as the logarithm or square root are common.

The number of hyperparameters delineated above indicate that there are a substantial number of choices to be made when creating a neural network learner, and that these choices will affect the success and failure of the model. In order to ensure reproducibility of results, a principled approach should be used for setting hyperparameters, or, at the very least, they should be explicitly stated in the model description. If a human is involved in hyperparameter search and the results are not stated explicitly, the results are not reproducible.

Hyperparameter selection can be seen as both an optimization problem (which is not necessarily convex in any single variable) and a generalization problem (because overfitting is still possible). It is made especially difficult due to the computational cost; each hyperparameter setting should be used to train a new model, and model training is typically expensive. However, it has been shown that for some hyperparameters, the best setting can be obtained from a cheaper estimator (e.g. randomly setting weights).

In most cases, a range of values is tried for each hyperparameter. It’s always possible that the best value falls on the edge of this range, which my indicate that a better value lies outside of the range. Due to non convexity, however, a “best” value on the interior of the search interval still doesn’t ensure that a better value doesn’t fall outside the interval. The “scale” of the interval also needs to be chosen (e.g. how the values are sampled). It often makes most sense to set the interval as a logarithmic range because the ratio between different values is often a better guide of the expected impact of the change. Once a good solution is found, we can adjust the scale and range that we’re searching over in order to fine-tune our best hyperparameter setting.

We can apply the idea of coordinate descent to hyperparameter optimization - keep all hyperparameters fixed except for one, and adjust that hyperparameter to minimize the validation error.

Grid search simply tries every hyperparameter setting over a specified range of values. This involves a cross-product of all intervals, so the computational expense is exponential in the number of parameters. Fortunately, it can be easily parallelized, but care should be taken to ensure that if one job fails it fails gracefully; otherwise a portion of the hyperparameter space could be left unexplored. Typically, a user-driven refinement approach is also used where a large-scale coarse search is first carried out, followed by successively more fine-tuned and fine-grained searches.

A straightforward alternative to grid search is to sample the hyperparameter space randomly. This is similarly trivially parallelizable and can work much better than grid search in practice because grid search can take an exponentially long time to reach a good hyperparameter subspace, and because only a few hyperparameters tend to matter a lot. We can also readily introduce hyperparameter distributions (continuous variables are typically uniform in the log domain, inside the interval of interest; discrete parameters are typically multinomially distributed) and encode conditional dependence between hyperparameters. The search can be terminated once the validation error plateaus.

Recently, various sequential model-based optimization methods have been proposed to search the hyperparameter space in a principled way. These techniques include modeling the generalization performance as a sample from a Gaussian process and as a graph-structured generative process using a tree-structured Parzen estimator. These approaches are implemented in the Hyperopt ^{2)}, Spearmint ^{3)} and SMAC ^{4)} packages.

neural_network_hyperparameters.txt · Last modified: 2015/12/17 21:59 (external edit)