elen_e4650_course_notes

Optimization indicates that we want to minimize some objective function $f_0(x)$ subject to some constraints, $f_1(x) \le 0, f_2(x) \le 0, \ldots f_m(x) \le 0$. $x$ is the optimization variable which belongs to $\mathbb{R}^n$. $f_0$ is the objective, which takes in a vector and returns a scalar, that is $f_0 : \mathbb{R}^n \rightarrow \mathbb{R}$. The feasible set $T$ is the set of all points which satisfy the constraints, given by $T = \{x | f_i(x) \le 0\}$. We can then write our optimization problem in the “geometric form” $\min_x f_0(x) \;\mathrm{s.t.}\; x \in T$. We can perform maximization by minimizing the negative of the objective function; we can handle equality constraints like $h(x) = 0$ by using two inequality constraints $h(x) \le 0$, $-h(x) \le 0$. Optimization problems show up in a huge variety of fields.

You might want to design an integrated circuit, which includes specifying its size (length and width). There are some constraints on the size (area), manufacturing limits, and delay (cased by the time it takes for signals to travel in the device). The objective is device performance, that is, power consumption.

If we receive some data from a system, we may want to fit a model to the data. As a common example, we have received data from a system which we have been told is a linear system. So, we'd like to fit a line (model) to the data, even when the data is noisy (not ideal). It could be much more complicated than fitting a line, but fundamentally we are always fitting a model. For a line, we are fitting the model $y = ax + b$ where the parameters are $a, b$. We may have constraints on, for example, the size of the parameters. The objective is to minimize the error between the mismatch or error between the model and the data.

There are two ways to solve a given optimization problem: A formula or a numerical algorithm. In some rare cases, we can obtain a formula which gives us a unique closed-form solution to the problem. Otherwise, we need to use a numerical algorithm to find the best solution without coming up with an explicit formula. Numerical algorithms work by guessing a solution $x^{(0)}$ and evaluates the objective to gain information about the guess's plausibility. Using this information, it comes up with more and more guesses $x^{(1)}, x^{(2)}, \ldots$ until it converges on a solution $x^\ast$. In practice, we will terminate the optimization after a certain number of steps.

Given a numerical algorithm, we'd like to be able to talk about the complexity - how many operations are needed to achieve the solution we seek. The computational complexity of a problem refers to the number of basic operations (addition, subtraction, division, multiplication, called flops) needed to solve it. For example, to multiply two vectors of size $n$, you need to perform $2n - 1$ operations. For multiplying two $n \times n$ matrices, you need on the order of $n^3$ operations.

In some cases, we have a linear system which transforms an input $x$ via a matrix $A$ and adds noise $n$ to produce the output $y$. If we know the output $y$ and the transformation matrix $A$ we want to determine $x$ even in the presence of noise. This can be formulated as an optimization problem $\min_x ||y - Ax||$. This problem has a closed-form solution given by $x^\ast = (A^\top A)^{-1} A^\top y$.

In a linear programming problem, all of the functions $f_i$ in the optimization problem are linear, giving $\min_x a^\top x \;\mathrm{s.t.}\; b_i^\top x \le c_i, i = 1, \ldots, m$ where $a \in \mathbb{R}^n, b_i \in \mathbb{R}^n, c_i \in \mathbb{R}$. There is no closed-form solution for this class of problems - it must be solved numerically with an algorithm with complexity $\mathcal{O}(n^2 m)$.

In the very simple problem $\min a^\top x \;\mathrm{s.t.}\; x_i^2 = 1$ there are $2^n$ possible choices for $x$, and we must evaluate all of them to find the optimal point. So, this problem has complexity $\mathcal{O}(2^n)$.

The objective function for a linear programming problem is linear. We can focus on a more general class of problems where the objective function and constraints are convex. A function $f(x) : \mathbb{R}^n \rightarrow \mathbb{R}$ is convex if $\forall x, y \in \mathbb{R}^n, \forall \alpha \in [0, 1]$ we have $\alpha f(x) + (1 - \alpha)f(y) \ge f(\alpha x + (1 - \alpha)y)$. The intuitive geometry is if you draw a line from one point on the function to another, the function will fall below this line.

In linear programming, the feasible set $T$ is a polygon. Again, we can focus on the more general class of convex feasible sets. A set $S$ is convex if $\forall x, y \in S, \forall \alpha \in [0, 1]$, we have $\alpha x + (1 - \alpha)y \in S$. Intuitively, all points on a segment between any two points in the set also lie in the set.

If we have some arbitrary optimization problem, there are three types of solutions - feasible, global, and local. A feasible solution is one which satisfies all the constraints - a point $x \;\mathrm{s.t.}\; f_i(x) \le 0, i = 1, \ldots, n$ or equivalently $x \in T$. A global solution is one which minimizes the objective function subject to all constraints - this is the typical notion of the “best” solution. A local solution is one which is good near its vicinity, which we may search for when finding all solutions over $T$ is difficult. A local solution is one where we can't perturb it just a little bit and make it better, but is not the global solution. More formally, a feasible solution is an $x^\ast$ which satisfies $x^\ast \in T$, a global solution is $x^\ast = \mathrm{arg}\min f_0(x) \;\mathrm{s.t.}\; x \in T$, and a local solution is one where $x^\ast = \mathrm{arg}\min f_0(x) \;\mathrm{s.t.}\; x \in T, ||x - x^\ast|| \le R$ for some $R$. For a convex function, any local solution is also global. Every convex optimization problem can be solved by a numerical algorithm whose complexity is polynomial in $n, m$.

Often, we can break up an optimization problem into a series of subproblems whose solutions depend on the solutions of one another. This allows us to use a numerical algorithm to solve each subproblem while communicating with the algorithms solving the other subproblems, which lends itself well to parallel computing.

A point $z$ is called an affine combination of the points $x$ and $y$ if there exist $\alpha, \beta$ if $z$ can be written as $z = \alpha x + \beta y$ such that $\alpha + \beta = 1$. In two dimensional space, this is represented as a line. $\alpha$ or $\beta$ can be come negative as long as they still sum to $1$. A point $z$ is called a convex combination of $x$ and $y$ if there exists $\alpha, \beta$ such that $z = \alpha x + \beta y$ and $\alpha + \beta = 1$ and $\alpha, \beta \ge 0$. In two dimensional space, this is just a line segment between $x$ and $y$. A point $z$ is called a conic combination of $x$ and $y$ if there exists $\alpha, \beta$ such that $z = \alpha x + \beta y$ and $\alpha, \beta \ge 0$. This set of points in a conic combination looks like a cone starting at the origin and with boundaries set by the points $x$ and $y$. Any of these definitions can be extended to the case where we are given $x_1, x_2, \ldots x_k$. More specifically, an affine combination is $\sum\limits_{i = 1}^k \alpha_i x_i$ such that $\sum \alpha_i = 1$, a convex combination requires $\sum \alpha_i = 1$ and $\alpha_i \ge 0 \; \forall i$, and a conic combination requires that $\alpha_i \ge 0 \;\forall i$.

A set $S$ is called affine if for every $k$ points in the set, the affine combinations of those points is in the set. Similarly, $S$ is convex if for every $k$ points in the set, the convex combination is in the set. Finally, $S$ is a cone if for every $k$ points in the set, their conic combination is in $S$. A hyperplane in $\mathbb{R}$ is a set of points which satisfies the equation $\sum\limits_{i = 1}^d a_i x_i = b$ or equivalently $a^\top x = b$. A halfplane is the set of all points on one side of a hyperplane, given by $a^\top x \ge b$ or $a^\top x \le b$. A polyhedron is the set of all vectors $x$ such that $a_i^\top x \le b_i$ for $i = 1, \ldots, m$ and $c_j^\top x = d_j$ for $j = 1, \ldots, k$. In other words, a polyhedron can be expressed by a number of hyperplanes and halfplanes defined by a number of linear equations. Equivalently, we are taking the intersections of set of halfspaces and hyperplanes.

Solving a linear programming problem involves the minimization $\min c^\top x \;\mathrm{s.t.}\;\; a_i^\top x \le b, i = 1, \ldots, m; c_j^\top x = d, j = 1, \ldots, k$. The feasible set of points for this problem is a polyhedron. The optimal point in a linear program will always be one of the vertices of the polyhedron. The number of vertices is exponential in the dimensionality of the problem. For example, the problem where each member $x_i$ of the optimization variable $x$ must be between $-1$ and $1$ can be written as $2n$ inequality constraints. In other words - a hypercube, which, in $n$-dimensional space has $2^n$ vertices, corresponding to each possible length $n$ sequence of $-1$ and $1$. This indicates that even when we only have 500 or so constraints, the simplex algorithm would need to consider over 1e150 vertices. There are much better algorithms.

Suppose we are given a nonconvex set, which we know is hard to optimize over. We can define the convex hull of the set as a surrogate which is convex, which is intuitively the smallest convex set which contains the nonconvex set. We can define the convex hull of a set $S$ as the set of all convex combinations of every point in $S$. For example, a discrete set of $L$ points has a convex hull which is a polyhedron. In general, we want to perform some kind of transformation on the set to convert it to a convex set.

In the space $\mathbb{R}^{n \times n}$, we can consider the set of all matrices of size $n \times n$. This set is convex because for any two matrices $X, Y \in M$, the matrix $\alpha X + (1 - \alpha)Y$ is also in $M$ for $\alpha \ge 0$. If we instead consider $S^n$, the set of all symmetric matrices in $\mathbb{R}^{n \times n}$, the quantity $\alpha X + (1 - \alpha)Y$ is still symmetric for $X, Y \in S^n$. The conic combination $\alpha X + \beta Y: \alpha, \beta \ge 0$ is also still symmetric, so $S^n$ is also conic.

The union of two convex sets is not necessarily convex, but the intersection of two convex sets is always convex. A linear mapping applied to a convex set is always convex, where a linear mapping of a point $x$ is defined as $f(x) = Ax + b$. We can apply a mapping to each point in a set $S$, written as $f(S)$, which we call the image of $S$ under $f$. When $f$ is a linear mapping, the inverse image of a convex set $f^{-1}(S)$ also preserves convexity.

We define the distance between two sets as the smallest distance between any two points for any pair of points in the sets.

We can always look at a convex set as the intersection of (potentially infinite) linear inequalities and equalities. More specifically, every convex set is the intersection of some (potentially infinite) halfplanes.

Given two convex sets, we might like to find a hyerplane which separates the two sets. That is, we would like to know if there exists an $a \ne 0, b$ for which $a^\top x \le b \; \forall x \in S_1$ and $a^\top x \ge b \; \forall x \in S_2$ for two convex sets $S_1$ and $S_2$. This is achievable for convex sets but not necessarily for nonconvex sets. We call the separation “strict” when the hyperplane doesn't intersect with $S_1$ or $S_2$. However, given two convex sets $S_1$, $S_2$ which do not intersect, it may be possible that there is no strict separation.

The loose definition of convexity for functions is that if you draw a line between two points on the function, the function will lie beneath the line. More specifically, we call a function $f : \mathbb{R}^n \rightarrow \mathbb{R}$ convex if $\forall x, y \in \mathbb{R}^n, \alpha f(x) + (1 - \alpha)f(y) \ge f( \alpha x + (1 - \alpha)y )$ where $\alpha \in [0, 1]$. We can derive a convex set from a convex function $f$ by looking at the set of points which lie above the function, the epigraph. We call a function $f$ concave when $-f$ is convex.

$f(x) = x^2$ is convex, $f(x) = -x^2$ is concave. Linear functions are both convex and concave (and they are the only functions which are both convex and concave). Piecewise functions can be convex. Intuitively, convexity has to do with the curvature.

If we have a convex function and we draw a line which is tangent to the curve at any point, it will lie beneath the function. More specifically, if $f$ is convex then $f(y) \ge f(x) + \nabla f(x)^\top(y - x)\; \forall x, y$. The quantity on the right of the inequality is a description of the tangent line. The gradient operator $\nabla$ is a high-dimensional generalization of the derivative, computed as the partial derivative of a function with respect to each dimension of the input variable. We call this condition the “first order condition”.

The curvature of the function can be studied using the second derivative of Hessian. The Hessian $\nabla^2$ is a matrix whose entries are the second order derivatives with respect to each pair of variables. A function is convex when its second derivative is positive, or equivalently in higher dimensions, its Hessian is positive semidefinite. A matrix is positive semidefinite when its eigenvalues are positive. Equivalently, a matrix $A \in \mathbb{R}^{n \times n}$ is positive semidefinite if $x^\top A x \ge 0 \; \forall x \in \mathbb{R}^n$. We write $A \succeq 0$ to denote positive semidefiniteness, and write $A \succeq B$ when $A - B \succeq 0$.

The function $f(x) = e^{ax}$ is convex for any $a$ because $f^{\prime \prime}(x) = a^2e^{ax}$, which is nonnegative because $a^2$ is nonnegative and $e^{ax}$ is nonnegative. The function $f(x) = x^a$ where $x$ is nonnegative is convex when $a \ge 1$ or $a \le 0$ because $f^{\prime \prime} = a(a - 1)x^{a - 2}$. When $0 < a < 1$, the function is concave. $f(x) = \log(x)$ is concave because $f^{\prime \prime} = \frac{-1}{x^2}$ which is always negative. $-\log(x)$ is concave. The function $f(x) = x\log(x)$ is convex because $f^{\prime \prime}(x) = 1/x$ which is positive for $x > 0$.

A quadratic function in higher dimensional space can be written as $x^\top Ax + Bx + c$, where $x \in \mathbb{R}^n$, $A, B \in \mathbb{R}^{n \times n}$. When $A$ is positive semidefinite, the quadratic function is convex because $\nabla^2 f(x) = 2A$. The function $f(x_1, x_2) = x_1^2 + x_2^2 + x_1 - x_2$ is convex because $\nabla^2 f(x_1, x_2) = 2I$. The similar function $f(x_1, x_2) = x_1^2 - x_2^2 + x_1 - x_2$ is not convex; the Hessian's eigenvalues are $2$ and $-2$.

Given two convex functions $f_1, f_2$, their sum $g(x) = f_1(x) + f_2(x)$ is convex because, for example, $\nabla^2 g(x) = \nabla^2 f_1(x) + \nabla^2 f_2(x) \succeq 0$. Similarly, a weighted sum $g(x) = w_1 f_1(x) + w_2 f_2(x)$ is convex as long as the weights $w_1, w_2 \ge 0$. A linear transformation applied to the argument $f(Ax + b)$ of a convex function $f$ preserves convexity.

We can also say a function $f$ is convex when $f(\sum\limits_{i = 1}^k x_1) \le \sum\limits_{i = 1}^k \alpha_i f(x_i)$ for $\alpha_i > 0$ and $\sum\limits_i \alpha_i = 1$. We can also view this probabilistically as $f(\mathbb{E}(x)) \le \mathbb{E}(f(x))$.

elen_e4650_course_notes.txt · Last modified: 2015/12/17 21:59 (external edit)