User Tools

Site Tools


stochastic_optimization_techniques

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
stochastic_optimization_techniques [2015/03/05 01:16]
craffel [RMSProp]
stochastic_optimization_techniques [2016/02/05 17:22] (current)
craffel
Line 3: Line 3:
 ====== Stochastic Optimization Techniques ====== ====== Stochastic Optimization Techniques ======
    
-Neural networks are often trained stochastically,​ i.e. using a method where the objective function changes at each iteration. ​ This stochastic variation is due to the model being trained on different data during each iteration. ​ This is motivated by (at least) two factors: First, the dataset used as training data is often too large to fit in memory and/or be optimized over efficiently. ​ Second, the objective function is typically nonconvex, so using different data at each iteration can help prevent the model from settling in a local minimum. ​ Furthermore,​ training neural networks is usually done using only the first-order gradient of the parameters with respect to the loss function. ​ This is due to the large number of parameters present in a neural network, which for practical purposes prevents the computation of the Hessian matrix. ​ Because vanilla gradient descent can diverge or converge incredibly slowly if its learning rate hyperparameter is set inappropriately,​ many alternative methods have been proposed which are intended to produce desirable convergence with less dependence on hyperparameter settings. ​ These methods often effectively compute and utilize a preconditioner on the gradient, adaptively change the learning rate over time or approximate the Hessian matrix.+Neural networks are often trained stochastically,​ i.e. using a method where the objective function changes at each iteration. ​ This stochastic variation is due to the model being trained on different data during each iteration. ​ This is motivated by (at least) two factors: First, the dataset used as training data is often too large to fit in memory and/or be optimized over efficiently. ​ Second, the objective function is typically nonconvex, so using different data at each iteration can help prevent the model from settling in a local minimum. ​ Furthermore,​ training neural networks is usually done using only the first-order gradient of the parameters with respect to the loss function. ​ This is due to the large number of parameters present in a neural network, which for practical purposes prevents the computation of the Hessian matrix. ​ Because vanilla gradient descent can diverge or converge incredibly slowly if its learning rate hyperparameter is set inappropriately,​ many alternative methods have been proposed which are intended to produce desirable convergence with less dependence on hyperparameter settings. ​ These methods often effectively compute and utilize a preconditioner on the gradient, adaptively change the learning rate over time or approximate the Hessian matrix.  This document summarizes some of the more popular methods proposed recently; for a similar overview see ((Ruder, An overview of gradient descent optimization algorithms http://​sebastianruder.com/​optimizing-gradient-descent/​index.html#​gradientdescentoptimizationalgorithms)) or the documentation of Climin ((https://​climin.readthedocs.org/​en/​latest/#​optimizer-overview)) or ((Schaul, Antonoglou, Silver, Unit Tests for Stochastic Optimization)) for a comparison on some simple tasks.
  
-In the following, we will use $\theta_t$ to denote some generic parameter of the model at iteration $t$, to be optimized according to some loss function $\mathcal{L}$ which is to be minimized.  For another brief overview of many of these methods with code examples in Theano, see ((https://​github.com/​lmjohns3/​theanets/​blob/​master/​theanets/​trainer.py)).+In the following, we will use $\theta_t$ to denote some generic parameter of the model at iteration $t$, to be optimized according to some loss function $\mathcal{L}$ which is to be minimized.
  
 ===== Stochastic Gradient Descent ===== ===== Stochastic Gradient Descent =====
Line 54: Line 54:
 In the original lecture slides where it was proposed, $\gamma$ is set to $.9$.  In ((Dauphin, Vries, Chung and Bengion, “RMSProp and equilibrated adaptive learning rates for non-convex optimization”)),​ it is shown that the $\sqrt{g_{t + 1}}$ term approximates (in expectation) the diagonal of the absolute value of the Hessian matrix (assuming the update steps are $\mathcal{N}(0,​ 1)$ distributed). ​ It is also argued that the absolute value of the Hessian is better to use for non-convex problems which may have many saddle points. In the original lecture slides where it was proposed, $\gamma$ is set to $.9$.  In ((Dauphin, Vries, Chung and Bengion, “RMSProp and equilibrated adaptive learning rates for non-convex optimization”)),​ it is shown that the $\sqrt{g_{t + 1}}$ term approximates (in expectation) the diagonal of the absolute value of the Hessian matrix (assuming the update steps are $\mathcal{N}(0,​ 1)$ distributed). ​ It is also argued that the absolute value of the Hessian is better to use for non-convex problems which may have many saddle points.
  
-Alternatively,​ in ((Graves, "​Generating Sequences with Recurrent Neural Networks"​)),​ a first-order moment approximator $m_t$ is added. ​ It is included in the denominator of the preconditioner so that the learning rate is effectively normalized by the standard deviation $\nabla \mathcal{L}$. ​ This gives +Alternatively,​ in ((Graves, "​Generating Sequences with Recurrent Neural Networks"​)),​ a first-order moment approximator $m_t$ is added. ​ It is included in the denominator of the preconditioner so that the learning rate is effectively normalized by the standard deviation $\nabla \mathcal{L}$.  There is also a $v_t$ term included for momentum.  This gives 
  
 \begin{align*} \begin{align*}
Line 62: Line 62:
 \theta_{t + 1} &= \theta_t + v_{t + 1} \theta_{t + 1} &= \theta_t + v_{t + 1}
 \end{align*} \end{align*}
-===== ESGD ===== 
- 
-((Dauphin, Vries, Chung and Bengion, "​RMSProp and equilibrated adaptive learning rates for non-convex optimization"​)) 
  
 ===== Adadelta ===== ===== Adadelta =====
 +
 +Adadelta ((Zeiler, "​Adadelta:​ An Adaptive Learning Rate Method"​)) uses the same exponentially decaying moving average estimate of the gradient second moment $g_t$ as RMSProp. ​ It also computes a moving average $x_t$ of the updates $v_t$ similar to momentum, but when updating this quantity it squares the current step, which I don't have any intuition for.
  
 \begin{align*} \begin{align*}
Line 77: Line 76:
 ===== Adam ===== ===== Adam =====
  
-Adam is somewhat similar to Adagrad/​Adadelta/​RMSProp in that it computes a decayed moving average of the gradient and squared gradient (first and second moment estimates) at each time step.  It differs mainly in two ways: First, the first order moment moving average coefficient is decayed over time.  Second, because the first and second order moment estimates are initialized to zero, some bias-correction is used to counteract the resulting bias towards zero.  The use of the first and second order moments, in most cases, ensure that typically the gradient descent step size is $\approx \pm \eta$ and that in magnitude it is less than $\eta$. ​ However, as $\theta_t$ approaches a true minimum, the uncertainty of the gradient will increase and the step size will decrease. ​ It is also invariant to the scale of the gradients. ​ Given hyperparameters $\beta_0$, $\gamma$, $\lambda$, and $\eta$, and setting $m_0 = 0$ and $g_0 = 0$ (note that the paper denotes $\gamma$ as $\beta_2$, $\beta$ as $\beta_1$, $\eta$ as $\alpha$ and $g_t$ as $v_t$), the update rule is as follows: ((Kingma and Ba, "Adam: A Method for Stochastic Optimization"​))+Adam is somewhat similar to Adagrad/​Adadelta/​RMSProp in that it computes a decayed moving average of the gradient and squared gradient (first and second moment estimates) at each time step.  It differs mainly in two ways: First, the first order moment moving average coefficient is decayed over time.  Second, because the first and second order moment estimates are initialized to zero, some bias-correction is used to counteract the resulting bias towards zero.  The use of the first and second order moments, in most cases, ensure that typically the gradient descent step size is $\approx \pm \eta$ and that in magnitude it is less than $\eta$. ​ However, as $\theta_t$ approaches a true minimum, the uncertainty of the gradient will increase and the step size will decrease. ​ It is also invariant to the scale of the gradients. ​ Given hyperparameters $\gamma_1$, $\gamma_2$, $\lambda$, and $\eta$, and setting $m_0 = 0$ and $g_0 = 0$ (note that the paper denotes $\gamma_1$ as $\beta_1$, $\gamma_2$ as $\beta_2$, $\eta$ as $\alpha$ and $g_t$ as $v_t$), the update rule is as follows: ((Kingma and Ba, "Adam: A Method for Stochastic Optimization"​))
  
 \begin{align*} \begin{align*}
-\beta_{t + 1} &= \beta_t \lambda \\ +m_{t + 1} &= \gamma_1 m_t + (1 - \gamma_1) \nabla \mathcal{L}(\theta_t) \\ 
-m_{t + 1} &= \beta_{t + 1}m_{t} ​+ (1 - \beta_{t + 1}) \nabla \mathcal{L}(\theta_t) \\ +g_{t + 1} &= \gamma_2 ​g_t + (1 - \gamma_2) \nabla \mathcal{L}(\theta_t)^2 \\ 
-g_{t + 1} &= \gamma g_t + (1 - \gamma) \nabla \mathcal{L}(\theta_t)^2 \\ +\hat{m}_{t + 1} &= \frac{m_{t + 1}}{1 - \gamma_1^{t + 1}} \\ 
-\hat{m}_{t + 1} &= \frac{m_{t + 1}}{1 - \beta^{t + 1}} \\ +\hat{g}_{t + 1} &= \frac{g_{t + 1}}{1 - \gamma_2^{t + 1}} \\ 
-\hat{g}_{t + 1} &= \frac{g_{t + 1}}{1 - \gamma^{t + 1}} \\ +\theta_{t + 1} &= \theta_t - \frac{\eta \hat{m}_{t + 1}}{\sqrt{\hat{g}_{t + 1}} + \epsilon}
-\theta_{t + 1} &= \theta_t - \frac{\eta \hat{m}_{t + 1}}{\sqrt{g_{t + 1}} + \epsilon}+
 \end{align*} \end{align*}
  
 +===== ESGD =====
 +
 +((Dauphin, Vries, Chung and Bengion, "​RMSProp and equilibrated adaptive learning rates for non-convex optimization"​))
  
 ===== Adasecant ===== ===== Adasecant =====
  
 ((Gulcehre and Bengio, "​Adasecant:​ Robust Adaptive Secant Method for Stochastic Gradient"​)) ((Gulcehre and Bengio, "​Adasecant:​ Robust Adaptive Secant Method for Stochastic Gradient"​))
- 
  
 ===== vSGD ===== ===== vSGD =====
stochastic_optimization_techniques.1425518179.txt.gz · Last modified: 2015/12/17 22:00 (external edit)