|Context||NIPS 2015 Oral|
In classic line search, we seek to find a scalar step size given a step direction which minimizes some (nonlinear) function. The step size (length) of the step is crucial for optimization performance. A well-known paradigm for adjusting the step size is the line search, which starts with an initial gradient calculation and then tries to find the point which minimizes according to the Wolfe conditions, iteratively. Whether the gradient is positive or negative, and whether the Wolfe conditions are satisfied or not, determines which direction to search in. Many classic line searches model the objective with a cubic polynomial, then search candidate points by collapsing the search space, and accepts if the Wolf conditions are satisfied. This works for deterministic functions, but many functions in machine learning are noisy. For example, doing mini-batch optimization introduces noise which is approximately Gaussian. To extend line search to the stochastic setting, the objective is modeled as a cubic spline Gaussian process surrogate, search is performed with Bayesian optimization, and will be terminated when some probabilistic Wolfe conditions are satisfied. Specifically, the Gaussian process is an integrated Wiener process which is robust and flexible and has analytic minima (root of the quadratic equation). For Bayesian optimization, expected improvement is used, evaluated only at a few well-chosen candidate points. For termination, the Wolfe conditions are positivity constraints, so the probabilistic Wolfe conditions simply formulate the probability that they are positive. This makes convergence much less sensitive to correct setting of the learning rate parameter.