incremental_methods_for_additive_convex_cost_minimization

Presenter | Asuman Ozdaglar |

Context | NIPS 2015 Invited Talk |

Date | 12/8/15 |

Some optimization problems are “additive cost”, that is, the sum of a finite number of convex real-valued functions. These arise in several applications, such as statistical estimation problems (LASSO, SVM, logistic regression) which minimize the objective which is the average of a loss function over data or minimizing an expected value or distributed optimization in networks. The number of component functions is often large, so it would be beneficial to use incremental methods which process component methods sequentially. In such an approach, the order in which the component functions are processed must be determined - either sequentially, a fixed arbitrary order, stochastically (e.g. SGD), sampling with replacement, or network-imposed orders. Sequentially is called “incremental gradient”, which has $O(1/\sqrt{k})$ convergence ($k$ is the number of iterations) for non-smooth functions and can be shown to have $O(1/k)$ for smooth functions. Random reshuffling (sampling with replacement) has $\Theta(1/k^{2s}), s \in (1/2, 1)$ converge rate, which is better than the $\Omega(1/k)$ of SGD.

At each iteration, the decision vector is updated incrementally by taking along the (sub)gradients of the component functions. Each iteration costs of a cycle with $m$ subiterations. This is equivalent to the online backpropagation method for training neural networks. If the component functions are one-dimensional quadratics, we can consider the “edge” of the total function as the “far-out region” where we are far away from the global minimum. Near the global minimum, the gradients of the components can conflict, so the choice of step size is important. A decaying stepsize is essential for global convergence to an optimal solution, and a small stepsize ensures convergence to a neighborhood of the optimal solution.

If we assume the component functions are smooth, can we guarantee better performance? Smooth means strong convexity and differentiable, and the gradients are Lipschitz, and the subgradients are bounded. We can view incremental method as an approximate gradient method, i.e. a gradient method with error - instead of taking a correct gradient step, we take a gradient step with some error. Smoothness allows us to replace the gradient by some form of multidimensional mean-value theorem expression. Using gradient Lipschitzness and boundedness the resulting error can be controlled. Finally, using strong convexity, we can get a recursive bound for the error. This recursion can be approximated to show that, given a decaying stepsize $R/k$ where $R$ is a constant, then the error is bounded provided $R$ is appropriately sized. If $R$ is not the right size, the convergence rate will be poor.

For quadratics, under very mild assumptions, one can actually show a rate result for the distance which is more refined, with a constant which is dependent on the order in which the component functions are chosen. This suggests processing functions with higher Lipschitz constants first.

The two most popular random orders are SGD and random reshuffling. There is a lot of empirical evidence that random reshuffling converges faster than SGD (Bottou 09), maybe even at a rate of $1/k^2$ rather than $1/k$. This has been anecdotally attributed to the random shuffling uniformly visiting the training data. Analysis is hard because of dependencies of gradient errors in and across shuffles. For strongly convex functions, SGD has $1/k$ min-max lower bounds for stochastic convex optimization, which can be found by Polyak-Ruppert averaging. Recently, it was shown that random reshuffling has a $1/k^{2s}, s \in (1/2, 1)$ with a constant which is not dependent on the number of component functions. In simple simulations, the difference in convergence rate is striking. An intuitive explanation: SGD samples the data uniformly and independently at each iteration. In a toy example with two quadratics, the gradient error is then uniformly the error of one functions minima or the other. When doing random reshuffling, this gradient error has reduced variance, but the inner iterates are correlated. To further accelerate random reshuffling, the bias in the last cycle can be estimated and subtracted to get $1/k^2$ convergence rate.

incremental_methods_for_additive_convex_cost_minimization.txt · Last modified: 2015/12/17 21:59 (external edit)