# http://colinraffel.com/wiki/

### Site Tools

stochastic_optimization_techniques

# Differences

This shows you the differences between two versions of the page.

 stochastic_optimization_techniques [2015/12/17 21:59]127.0.0.1 external edit stochastic_optimization_techniques [2016/02/05 17:22] (current)craffel Both sides previous revision Previous revision 2016/02/05 17:22 craffel 2015/12/17 21:59 external edit2015/08/17 17:51 craffel [Adam] 2015/07/24 19:05 craffel [Stochastic Optimization Techniques] 2015/06/09 15:11 craffel [Adam] 2015/06/09 15:10 craffel [Adam] 2015/03/05 02:06 craffel 2015/03/05 01:16 craffel [RMSProp] 2015/03/04 22:45 craffel [Stochastic Gradient Descent] 2015/03/04 22:08 craffel 2015/03/04 21:57 craffel [RMSProp] 2015/03/04 21:11 craffel [Stochastic Gradient Descent] 2015/03/04 15:47 craffel [RPROP] 2015/03/04 15:46 craffel 2015/03/04 15:38 craffel [RMSProp] 2015/03/04 05:09 craffel 2015/03/04 04:58 craffel 2015/03/04 04:00 craffel 2015/03/04 02:08 craffel created Next revision Previous revision 2016/02/05 17:22 craffel 2015/12/17 21:59 external edit2015/08/17 17:51 craffel [Adam] 2015/07/24 19:05 craffel [Stochastic Optimization Techniques] 2015/06/09 15:11 craffel [Adam] 2015/06/09 15:10 craffel [Adam] 2015/03/05 02:06 craffel 2015/03/05 01:16 craffel [RMSProp] 2015/03/04 22:45 craffel [Stochastic Gradient Descent] 2015/03/04 22:08 craffel 2015/03/04 21:57 craffel [RMSProp] 2015/03/04 21:11 craffel [Stochastic Gradient Descent] 2015/03/04 15:47 craffel [RPROP] 2015/03/04 15:46 craffel 2015/03/04 15:38 craffel [RMSProp] 2015/03/04 05:09 craffel 2015/03/04 04:58 craffel 2015/03/04 04:00 craffel 2015/03/04 02:08 craffel created Line 3: Line 3: ====== Stochastic Optimization Techniques ====== ====== Stochastic Optimization Techniques ====== - Neural networks are often trained stochastically,​ i.e. using a method where the objective function changes at each iteration. ​ This stochastic variation is due to the model being trained on different data during each iteration. ​ This is motivated by (at least) two factors: First, the dataset used as training data is often too large to fit in memory and/or be optimized over efficiently. ​ Second, the objective function is typically nonconvex, so using different data at each iteration can help prevent the model from settling in a local minimum. ​ Furthermore,​ training neural networks is usually done using only the first-order gradient of the parameters with respect to the loss function. ​ This is due to the large number of parameters present in a neural network, which for practical purposes prevents the computation of the Hessian matrix. ​ Because vanilla gradient descent can diverge or converge incredibly slowly if its learning rate hyperparameter is set inappropriately,​ many alternative methods have been proposed which are intended to produce desirable convergence with less dependence on hyperparameter settings. ​ These methods often effectively compute and utilize a preconditioner on the gradient, adaptively change the learning rate over time or approximate the Hessian matrix. + Neural networks are often trained stochastically,​ i.e. using a method where the objective function changes at each iteration. ​ This stochastic variation is due to the model being trained on different data during each iteration. ​ This is motivated by (at least) two factors: First, the dataset used as training data is often too large to fit in memory and/or be optimized over efficiently. ​ Second, the objective function is typically nonconvex, so using different data at each iteration can help prevent the model from settling in a local minimum. ​ Furthermore,​ training neural networks is usually done using only the first-order gradient of the parameters with respect to the loss function. ​ This is due to the large number of parameters present in a neural network, which for practical purposes prevents the computation of the Hessian matrix. ​ Because vanilla gradient descent can diverge or converge incredibly slowly if its learning rate hyperparameter is set inappropriately,​ many alternative methods have been proposed which are intended to produce desirable convergence with less dependence on hyperparameter settings. ​ These methods often effectively compute and utilize a preconditioner on the gradient, adaptively change the learning rate over time or approximate the Hessian matrix.  This document summarizes some of the more popular methods proposed recently; for a similar overview see ((Ruder, An overview of gradient descent optimization algorithms http://​sebastianruder.com/​optimizing-gradient-descent/​index.html#​gradientdescentoptimizationalgorithms)) or the documentation of Climin ((https://​climin.readthedocs.org/​en/​latest/#​optimizer-overview)) or ((Schaul, Antonoglou, Silver, Unit Tests for Stochastic Optimization)) for a comparison on some simple tasks. In the following, we will use $\theta_t$ to denote some generic parameter of the model at iteration $t$, to be optimized according to some loss function $\mathcal{L}$ which is to be minimized. In the following, we will use $\theta_t$ to denote some generic parameter of the model at iteration $t$, to be optimized according to some loss function $\mathcal{L}$ which is to be minimized.