User Tools

Site Tools


stochastic_optimization_techniques

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
stochastic_optimization_techniques [2015/08/17 17:51]
craffel [Adam]
stochastic_optimization_techniques [2016/02/05 17:22] (current)
craffel
Line 3: Line 3:
 ====== Stochastic Optimization Techniques ====== ====== Stochastic Optimization Techniques ======
    
-Neural networks are often trained stochastically,​ i.e. using a method where the objective function changes at each iteration. ​ This stochastic variation is due to the model being trained on different data during each iteration. ​ This is motivated by (at least) two factors: First, the dataset used as training data is often too large to fit in memory and/or be optimized over efficiently. ​ Second, the objective function is typically nonconvex, so using different data at each iteration can help prevent the model from settling in a local minimum. ​ Furthermore,​ training neural networks is usually done using only the first-order gradient of the parameters with respect to the loss function. ​ This is due to the large number of parameters present in a neural network, which for practical purposes prevents the computation of the Hessian matrix. ​ Because vanilla gradient descent can diverge or converge incredibly slowly if its learning rate hyperparameter is set inappropriately,​ many alternative methods have been proposed which are intended to produce desirable convergence with less dependence on hyperparameter settings. ​ These methods often effectively compute and utilize a preconditioner on the gradient, adaptively change the learning rate over time or approximate the Hessian matrix.+Neural networks are often trained stochastically,​ i.e. using a method where the objective function changes at each iteration. ​ This stochastic variation is due to the model being trained on different data during each iteration. ​ This is motivated by (at least) two factors: First, the dataset used as training data is often too large to fit in memory and/or be optimized over efficiently. ​ Second, the objective function is typically nonconvex, so using different data at each iteration can help prevent the model from settling in a local minimum. ​ Furthermore,​ training neural networks is usually done using only the first-order gradient of the parameters with respect to the loss function. ​ This is due to the large number of parameters present in a neural network, which for practical purposes prevents the computation of the Hessian matrix. ​ Because vanilla gradient descent can diverge or converge incredibly slowly if its learning rate hyperparameter is set inappropriately,​ many alternative methods have been proposed which are intended to produce desirable convergence with less dependence on hyperparameter settings. ​ These methods often effectively compute and utilize a preconditioner on the gradient, adaptively change the learning rate over time or approximate the Hessian matrix.  This document summarizes some of the more popular methods proposed recently; for a similar overview see ((Ruder, An overview of gradient descent optimization algorithms http://​sebastianruder.com/​optimizing-gradient-descent/​index.html#​gradientdescentoptimizationalgorithms)) or the documentation of Climin ((https://​climin.readthedocs.org/​en/​latest/#​optimizer-overview)) or ((Schaul, Antonoglou, Silver, Unit Tests for Stochastic Optimization)) for a comparison on some simple tasks.
  
 In the following, we will use $\theta_t$ to denote some generic parameter of the model at iteration $t$, to be optimized according to some loss function $\mathcal{L}$ which is to be minimized. In the following, we will use $\theta_t$ to denote some generic parameter of the model at iteration $t$, to be optimized according to some loss function $\mathcal{L}$ which is to be minimized.
stochastic_optimization_techniques.1439833884.txt.gz ยท Last modified: 2015/12/17 22:00 (external edit)