elen_4835_course_notes

More and more it's less important to be able to process a single video or song, and instead it's important to be able to process a very large amount of very high-dimensional data. However, this data typically has some kind of low-dimensional structure, which would be nice to describe/exploit. We'd like to build a representation which in a very loose sense can “understand” something about the signals.

Whether the task is classification, compression, etc., the goal is to find a model which helps with the task as opposed to being “handed” a model. In other words, starting with some motivating problems, we attempt to simplify them to determine what's relevant in the signal, and then abstract a mathematical model. Once the model has been written down, we can study its theory and develop algorithms to compute it, and then use it in practice to determine its effectiveness. This can then re-motivate new models.

Often a signal is given to us as a collection of numbers in a computer. A grayscale image of size $W \times H$ is typically described by a quantized number per pixel - RGB has three quantized numbers per pixels. Although it's discretized, we can think of a greyscale image as a matrix in $\Re^{W \times H}$. A sound digitized to $L$ samples is a vector in $\Re^L$.

If we have two signals in $v_1, v_2 \in \Re^L$, we can add them and perform scalar multiplication. A vector space $V$ (over $\Re$) is a set with these two operations - addition and scalar multiplication - which $\forall x, y, z \in \Re^L, \alpha, \beta \in \Re$ satisfies

- $x + y = y + x$
- $x + (y + z) = (x + y) + x$
- $\exists \vec{0} \in V : \vec{0} + x = x$
- $\alpha(x + y) = \alpha x + \alpha y$
- $(\alpha + \beta)x = \alpha x + \beta x$
- $\alpha(\beta x) = (\alpha\beta)x$
- $1x = x, 0x = \vec{0}$.

- $\Re^L, \Re^{W \times H \times 3}$
- $\{x_i \in \Re\}_{i=1}^{\infty}$
- $C[0, 1] = \{f : [0, 1] \rightarrow \Re : f\mathrm{\;continuous}\}$

A linear subspace $W$ of a vector space $V$ is a subset of $V$ that is itself a vector space. A sufficient condition is that $\forall w_1, w_2 \in W, \alpha \in \Re; w_1 + w_2 \in W, \alpha w_1 \in W$.

$v_1, \ldots v_N$ are linearly dependent if $\exists \alpha_1, \ldots \alpha_N$ (not all zero) such that $\alpha_1v_1 + \ldots + \alpha_Nv_N = \vec{0}$. If $v_1, \ldots v_N$ are not linearly dependent, they are linearly independent.

A basis for $V$ is a set of vectors $v_1, \ldots v_N$ which are linearly independent such that $\forall v \in V, \exists \alpha_1 \ldots \alpha_N : \sum_i \alpha_i v_i = v$.

**Theorem**
If $v_1, \ldots v_n$ form a basis for $V$ and $w_1,
\ldots w_m$ also form a basis for $V$ then $m = n$ - any basis has the same
size.

**Proof**
Suppose that $m > n$. Since $w_1 \in V$, $\exists \alpha_i \mathrm{\;s.t.\;} w_1 = \alpha_1 + v_1 + \ldots \alpha_n v_n$. Since $w_i$ are a basis, $w_1 \ne \vec{0}$ because $\vec{0}$ is not linearly independent of any vector. $v_i$ are linearly independent so some $\alpha_i$ must not be zero - assume WLOG it is $\alpha_1$. Then we can write $v_1 = \alpha_1^{-1} w_1 + (-\alpha_1^{-1}\alpha_2)v_2 + \ldots + (-\alpha_1^{-1}\alpha_n)v_n$. This implies that $w_1, v_2, v_3, \ldots v_n$ linearly span $V$ and are linearly independent (a basis), so we can write $w_2 = \zeta_1 w_1 + \zeta_2 v_2 + \ldots + \zeta_n v_n$. As before, we can write $v_2 = \zeta_2^{-1}\zeta_1)w_1 + \zeta_2^{-1}w_2 + (-\zeta_2^{-1}\zeta_3)v_3 + \ldots + (-\zeta_2^{-1}\zeta_n)v_n$ and show that $w_1, w_2, v_3, \ldots v_n$ are a basis for $V$. Continuing in this manner, we can demonstrate that $w_1, \ldots w_m$ is a basis for $V$, such that we can write $w_{n+1} = \sum_{i = 1}^n \xi_i w_i$, which raises a contradiction because it implies $w_1, \ldots w_m$ are linearly dependent. If we suppose $m < n$ we can repeat this argument substituting $v$ for $w$. Therefore, $m = n$.

$L : V \rightarrow W$ is a linear map if $\forall x, y \in V, \alpha, \beta \in \Re, L(\alpha x + \beta y) = \alpha L(x) + \beta L(y)$. Sampling, Fourier/wavelet transforms are all examples of linear maps.

Every linear map has a range defined by $\mathrm{range}(L) = \{w \in W \mathrm{\;s.t.\;} w = \phi(v), v \in V\}$ and a null space defined by $\mathrm{null}(L) = \{v \in V \mathrm{\;s.t.\;} L(v) = 0\}$. The range is a subspace of $W$; the nullspace is a subspace of $V$. The rank of a linear map is the dimension of its range.

A linear map $L: V \rightarrow W$ can be represented as a matrix multiplication. If we take a basis $v_1, \ldots v_N$ for $V$, then we can write any $v \in V$ as a linear combination of elements from the basis $v = \sum_{i=1}^{n} v_i \alpha_i$. Given a basis $w_1, \ldots w_m$ for a target space $W$, we have $L(v) = \beta_1 w_1 + \ldots \beta_m w_m$. If we set $A_{i, j}$ to the coefficients of the expression for $L(v_j)$ in terms of the $w_i$, we have $L(v_j) = A_{1, j}w_1 + A_{2, j}w_2 + \ldots + A_{m, j}w_m$. Then we have $\beta = A\alpha$. If $U, V, W$ are vector spaces with linear maps $\phi : U \rightarrow V$ and $\psi: V \rightarrow W$ with corresponding matrices $\Phi$ and $\Psi$ then the matrix representation of $\psi \odot \phi(u) = \psi( \phi( u ) )$ is just the matrix product $\Phi\Psi$.

We can think of a set of observations $x_1, \ldots x_n \in \Re^m$, as a data matrix $X = [x_1 | \ldots | x_n ] \in \Re^{m \times n}$.

The $\mathrm{range}(A) = \{Ax | x \in \Re^n \}$ is the “column space” of $A$, a subset of $\Re^m$. The rank of $A$ is equal to the size of $\mathrm{range}(A)$, which is equal to the number of linearly independent columns, which is equal to the number of linearly independent rows. $\mathrm{null}(A) = \{x | Ax = \vec{0} \}$ is the nullspace, which is a subset of $\Re^n$. That $x \perp y$ means that $\langle x, y \rangle = \sum_i x_i y_i = 0$. For a set $S$, $S^\perp = \{y | y\perp z \forall z \in S\}$. The null space of a matrix is orthogonal to the range of the the matrix transposed, or $\mathrm{null}(A) = [\mathrm{range}(A^\top)]^\perp$. There is a solution to $y = Ax$ if $y \in \mathrm{range}(A)$; the solution is unique if $\mathrm{null}(A)$ is empty. A matrix is orthogonal if $M^\top M = I$, that is, its columns or orthogonal.

For some vector $x \in \Re^n$, $||x||_2 = \sqrt{x_1^2 + \ldots + x_n^2} = \sqrt{x^\top x}$, the L-2 norm, the “length” of the vector. Normalized vectors have $||x||_2 = 1$. For some matrix $X \in \Re^{m \times n}$, the Frobinius norm is $||X||_F = \sqrt{\sum_{ij} X_{ij}^2}$. This is just summing up the square of each element, and taking the square root. We can define the matrix inner product $\langle X, Y \rangle = \sum_{ij} X_{ij}Y_{ij} = \mathrm{tr}[X^\top Y]$, so $||X||_F = \sqrt{\langle X, X \rangle}$. The trace is defined as $\mathrm{tr}[M] = \sum_iM_{ii}$.

When $A$ is square, we can say $A$ is non-singular if $\forall y, \exists x : Ax = y$ or in other words $A^{-1}$ exists. $A^{-1}$ is a linear map. The determinant is defined by $\mathrm{det} = \sum_{\sigma(M)} \mathrm{sgn}(\sigma)\prod_i A_{i, \sigma(i)}$ where $\sigma(M)$ is the permutation of all integers from $1$ to $M$. This is essentially measuring volume. We can compute the inverse by $A^{-1} = \frac{1}{\mathrm{det}(a)} \mathrm{adj}(A)$. A is nonsingular if $\mathrm{det}(A) \ne 0$. If $A, B$ are square, then $\mathrm{det}(AB) = \mathrm{det}(A)\mathrm{det}(B)$.

Given a set of vectors $v_1, \ldots, v_n$, assumed to be linearly independent (though this is not necessary), we can find an orthonormal basis $\phi_1, \ldots, \phi_n$ for them by recursively setting $\phi_1 = \frac{v_1}{||v_1||_2}$ and $\phi_n = \frac{v_n - \sum_{i=1}^{n-1} \langle v_n, \phi_i \rangle \phi_i}{||v_n - \sum_{i=1}^{n-1} \langle v_n, \phi_i \rangle \phi_i||_2}$. If we let $V = [v_1 | \cdots | v_n]$ and $Q = [\phi_1 | \cdots | \phi_n]$ then we can find an upper triangular $R$ such that $V = QR$.

Symmetric matrices $A$ have quadratic forms $f(x) = x^\top Ax$. Given an arbitrary square $A$, you can always decompose $A$ as the sum of a symmetric part $\frac{A + A^\top }{2}$ and an antisymmetric part $\frac{A - A^\top}{2}$.

- The Hessian: Given $f : \Re^m \rightarrow \Re$ where $f$ is twice differentiable, $[\nabla^2 f(x) ]_{ij} = \frac{\partial^2f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$.
- Given a random vector $X = (x_1, \ldots x_n)$ then the covariance matrix is $\Sigma_{i, j} = (x_i - \mathbb{E}[x_i])(x_j - \mathbb{E}[x_j])$.

Given $A \in \Re^{m \times m}$ then a vector $x \in \mathbb{C}^m$ is an eigenvector of $A$ if $x \ne \vec{0}$ and $\exists \lambda \in \mathbb{C}$ such that $Ax = \lambda x$. $\lambda$ is the eigenvalue associated with $x$. Equivalently, $(A - \lambda I)x = 0$ so $x$ is in the nullspace of $A - \lambda I$, and $\mathrm{det}(A - \lambda I) = 0$. $\chi (\lambda) = \mathrm{det}(A - \lambda I)$ is the characteristic polynomial associated with $A$. This is a degree $m$ polynomial whose roots are the eigenvalues. We can say $A$ has $m$ complex eigenvalues if we count with multiplicity.

**Theorem**
Eigenvalues of symmetric matrices are real.

**Proof**
If $\lambda$ is an eigenvalue of $A$, $\exists x \ne \vec{0} : Ax = \lambda x$. Taking the complex conjugate, $\bar{A}\bar{x} = \bar{\lambda} \bar{x} \rightarrow A\bar{x} = \bar{\lambda} \bar{x} \rightarrow \bar{x}^\top Ax = \lambda \bar{x}^\top x \rightarrow x^\top A \bar{x} = \bar{\lambda}x^\top x$$ \rightarrow \bar{x}^\top Ax - x^\top A\bar{x} = (\lambda - \bar{\lambda})x^\top \bar{x}$. The right side of this equation is zero, and $x^\top \bar {x} \ne 0$, so $ \lambda = \bar{\lambda}$.

**Theorem**
Eigenvectors corresponding to different eigenvalues are orthogonal.

**Proof**
If $x_1, x_2$ are eigenvectors of a symmetric matrix $A$, and $\lambda_1, \lambda_2$ are the corresponding eigenvalues, and $\lambda_1 \ne \lambda_2$, then $x_1^\top x_2 = 0 \rightarrow x_1 \perp x_2$ because if $Ax_1 = \lambda_1 x_1, Ax_2 = \lambda_2 x_2$ then $x_2^\top Ax_1 = \lambda_1 x_2^\top x_1, x_1^\top Ax_2 = \lambda_2 x_1^\top x_2 \rightarrow 0 = (\lambda_1 - \lambda_2)x_1^\top x_2$.

If the eigenvalues $\lambda_1, \ldots \lambda_m$ are distinct, $x_1, \ldots x_m$ form a basis for the space, or in other words if $\lambda_1$ is an eigenvalue of $A$ with multiplicity $k$ then there exist $k$ mutually orthogonal eigenvectors $x_1, \ldots, x_k$ which correspond to $\lambda_1$. Using induction on $k$, we take $x_k$ to be one of the eigenvectors corresponding to $\lambda_1$ assumed to have unit norm $x_k^\top x_k = 1$ and a matrix $H \in \Re^{m \times (m-1)} : H^\top H = I$, $x_k^\top H = 0, B = [x_k | H], B^\top B = I$, $\mathrm{det}(A - \lambda I) = \mathrm{det}(B^\top )\mathrm{det}(A - \lambda I)\mathrm{det}(B) = \mathrm{det}(B^\top AB - \lambda B^\top B)$$ = \mathrm{det}(B^\top AB - \lambda I)$. Noting that $B^\top A B = [ [\lambda_1, 0], [0, H^\top AH] ]$, we have $\mathrm{det}(B^\top AB - \lambda I) = \mathrm{det}([\,[\lambda_1 - \lambda, 0], [0, H^\top AH - \lambda I]\,] = (\lambda_1 - \lambda)\mathrm{det}(H^\top A H - \lambda I)$ so $\lambda_1$ is a root with multiplicity $k - 1$.

Given a symmetric matrix $A \in \Re^{m \times m}$ there exist real eigenvalues $\lambda_1, \ldots \lambda_m$ corresponding to $v_1, \ldots v_m$ forming matrix $V = [v_1 | \ldots | v_m ]$ with $V^\top V = I$ (orthonormal columns, $V^{-1} = V^\top $), so $A$ can be decomposed as $V\Lambda V^\top$, the eigenvector decomposition, where $\Lambda$ is the diagonal matrix of eigenvalues.

** Theorem **
Given a matrix $M \in \Re^{m \times n}, m \ge n$ there exist scalars $\sigma_1, \sigma_2, \ldots, \sigma_N \ge 0$ and orthogonal matrices $U \in \Re^{m \times m}, V \in \Re^{n \times n}$ such that $M = U \Sigma V^\top$ where $\Sigma$ is the diagonal matrix (padded with zeros such that $\Sigma \in \Re^{m \times n}$) of the singular values $\sigma_m$. Equivalently, $M = \sum_{i=1}^n \sigma_i u_i v_i^\top$. If $\mathrm{rank}(M) = r$, we can construct smaller matrices $\tilde{U} = [u_1 | \ldots | u_r]; \tilde{V} = [v_1, \ldots, v_r]$ and $\tilde{\Sigma}$ is the diagonal matrix of SVs up to $\sigma_r$, then $M = \tilde{U}\tilde{\Sigma}\tilde{V}^\top$.

** Proof **
Looking at $M^\top M \in \Re^{n \times n}$, a square, symmetric matrix with eigenvalue decomposition $M^\top M = V\Lambda V^\top$ with $\mathrm{rank}(M^\top M) = \mathrm{rank}(M) = \mathrm{rank}(\Lambda) = r$, $r$ of the eigenvalues $\lambda_i$ are zero with $\lambda_1 > \lambda_2 > \ldots \lambda_r$ all positive. Setting $\tilde{\Lambda}$ as the diagonal matrix of the first $r$ eigenvalues and $\tilde{V} = [v_1 | \ldots | v_r] \in \Re^{n \times r}$, then $\tilde{U} = M\tilde{V}\tilde{\Lambda}^{-1/2}$ where $\tilde{\Lambda}^{-1/2}$ is the diagonal matrix of $1/\sqrt{\lambda_i}$. Then $\tilde{U} \in \Re^{m \times r}, \tilde{U}^\top\tilde{U} = I$, $\tilde{V}^\top\tilde{V} = I$, and $\tilde{\Sigma} = \tilde{\Lambda}^{1/2} \in \Re^{r \times r}$ so $\tilde{U}\tilde{\Sigma}\tilde{V}^\top = M\tilde{V}\tilde{V}^\top$. Hopefully this last term is equal to $M$. We have $M^\top M = V\Lambda V^\top = \sum_i \lambda_i v_i v_i^\top = (\sum_i \lambda_i v_i v_i^\top)(\sum_{j=1}^r v_j v_j^\top)$$ = \sum_{i = 1}^n \sum_{j = 1}^r \lambda_i v_i v_i^\top v_j v_j^\top = \sum_{i = 1}^r \lambda_i v_iv_i^\top = M$. This shows that $v_i, …, v_r$ form an orthonormal basis for the rowspace of $M$, so $\tilde{U}\tilde{\Sigma}\tilde{V}^\top = M$.

The SVD writes the matrix $M$ as the product of an orthogonal matrix, a diagonal matrix, and another orthogonal matrix. $U$ forms a basis for the column space, $V$ forms a basis for the row space.

** Theorem **
Given $M \in \Re^{m \times n}$ with SVD $M = U \Sigma V^\top = \sum_{i = 1}^n \sigma_i u_i v_i^\top$, the matrix $\hat{M}$ with $\mathrm{rank}(\hat{M}) \le r$ which minimizes $||M - \hat{M}||_F^2$ is given by $\hat{M}_{SVD} = \sum_{i = 1}^r \sigma_i u_i v_i^\top$ with error $||M - \hat{M}_{SVD}||_F^2 = \sum_{i = r+1}^n \sigma_i^2$.

** Proof **
First, note that that $||M||_F^2 = \sum_{ij}M_{ij}^2 = \sum_i(\sum_j M_{ij}^2) =$$ \sum_i ||e_i^\top M||_2^2 = \sum_j(\sum_i M_{ij}^2) = \sum_j ||M e_j||_2^2$. The L2 norm is orthogonal invariant, so for all orthonormal $Q, R$ it is always true $||RMQ||_F^2 = ||M||_F^2$ (the Frobenius norm is rotationally invariant). Then $||M||_F^2 = ||U^\top M V||_F^2 = ||U^\top U \Sigma V^\top V||_F^2 = ||\Sigma||_F^2 = \sum_i \sigma_i^2$ so $||M - \hat{M}_{SVD}||_F^2 = \sum_{i = r + 1}^r \sigma_i^2$. If $X$ has $\mathrm{rank}(X) \le r$, we can write $X = U_XY$ where $U_X \in \Re^{m \times r}, U_X^\top U_X = I, Y \in \Re^{r \times n}$ so we can minimize by $\min_{U_X}\min_Y ||M - U_XY||_F^2$. Setting $\Phi(Y) = ||M - U_XY||_F^2 = \langle M - U_X Y, M - U_X Y \rangle$, then $\nabla_Y \Phi(Y) = 0$ will give us the optimal $Y$. We have $\nabla \Phi(y)$ is a $r \times n$ matrix with $\nabla \Phi(y) = \frac{\partial \Phi}{\partial y_i} = -2 U_X^\top \langle M - U_X Y\rangle$. Setting equal to zero we have the minimum solution $U_X^\top M = U_X^\top U_X Y$ giving stationary point $Y_\star = U_X^\top M$. Plugging in gives $\min_{\mathrm{rank}(M) = r} ||M - \hat{M}||_F^2 = \min_{U_X} ||M - U_X U_X^\top M||_F^2$. Note that for any vector $v$ and any matrix $U_X$ with orthogonal columns, $||v||_2^2 = ||U_X U_X^\top v + (I - U_X U_X^\top) v ||_2^2 = ||U_X U_X^\top v||_2^2 + ||(I - U_X U_X^\top)v||_2^2 + 2\langle U_X U_X^\top v, (I - U_X U_X^\top)v \rangle$. Note that because $U_X^\top U_X = I$ we have $U_X^\top U_X U_X^\top = U_X^\top \rightarrow 0 = U_X^\top - U_X^\top U_X U_X^\top = U_X^\top(I - U_X U_X^\top)$ so $\langle U_X U_X^\top v, (I-U_X U_X^\top) v \rangle = v^\top U_X U_X^\top (I-U_X U_X^\top) v = 0$. From above we now have $||v||_2^2 = ||U_X U_X^\top v||_2^2 + ||(I - U_X U_X^\top)v||_2^2$. Since $||M||_F^2 = \sum_i ||M e_i||_2^2$ we have $||M||_F^2 = ||U_X U_X^\top M||_F^2 + ||(I-U_X U_X^\top) M ||_F^2 \rightarrow ||M||_F^2 - ||U_X U_X^\top M||_F^2 = ||M-U_X U_X^\top M||_F^2$. So our minimization from above becomes $\min_{U_X} ||M||_F^2 ||U_XU_X^\top M||_F^2 = ||M||_F^@ - \max_{U_X} ||U_X U_X^\top M||_F^2$. Now, via rotational invariance we have $||U_XU_X^\top M||_F^2 = ||U_XU_X^\top U\Sigma V^\top||_F^2 = ||U U_XU_X^\top U\Sigma||_F^2 = ||\tilde{U_X}\tilde{U_X^\top} \Sigma||_F^2$. Since $\tilde{U_X}$ is in the domain of optimization we can just overload and rewrite our maximization as $\max_{U_X} ||U_X U_X^\top \Sigma||_F^2$. Using the column L2 norm form, we can instead maximize $\max_{U_X} \sum_{i = 1}^n \sigma_i^2 ||U_X U_X^\top e_i||_2^2$. Using rotational invariance, $||U_X U_X^\top e_i||_2^2 = ||U_X^\top U_X U_X^\top e_i||_2^2 = ||U_X^\top e_i||_2^2$. If we then write $v_i = ||U_X^\top e_i||_2^2$ we have $v_i \in [0, 1]$, $\sum_{i = 1} v_i = r$. So $\max_{U_X} \sum_{i = 1}^n \sigma_i^2||U_X U_X^\top e_i||_2^2 \le \max_v \sum_{i=1}^n v_i \sigma_i^2$. The term on the right is maximized when $v_1 = v_2 = \ldots = v_r = 1$ and $v_{r+1} = \ldots = v_n = 0$.

The SVD can be computed efficiently, and there are many implementations written. One method for computing the SVD is the power method which takes advantage of the fact that the singular vectors come from the eigenvectors of $A = M^\top M$ or $MM^\top $. $A$ is symmetric with orthonormal basis of eigenvectors $A = V \Lambda V^\top$ which has real, non-negative eigenvalues. Taking $x_0 \in \Re^m$ we can compute $x_1 = \frac{Ax_0}{||Ax_0||_2}, x_2 = \frac{Ax_1}{||Ax_1||_2}, \ldots x_k = \frac{A^k x_0}{||A^k x_0||_2}$. If the eigenvaleus of $A$ are all different ($\lambda_1 > \lambda_2 > \ldots > \lambda_n \ge 0$) with $v_1$ normalized corresponding to $\lambda_1$ and $x_0^\top v_1 \ne 0$ then $x_k \rightarrow \{v_1, -v_1 \}$ as $k \rightarrow \infty$. Noting that $A^k = V \Lambda^k V^\top$ (raising $A$ to the $k$ just raises the eigenvalues to the $k$) we have $A^k x_0 = V \Lambda^k V^\top x_0 = V \Lambda^K [v_1^\top x_0; v_2^\top x_0; \ldots v_k^\top x_0]$. The last term can be written as $[v_1^\top x_0; v_2^\top x_0; \ldots v_k^\top x_0] = \frac {\Lambda^k [v_1^\top x_0; v_2^\top x_0; \ldots v_k^\top x_0]}{||\Lambda^k [v_1^\top x_0; v_2^\top x_0; \ldots v_k^\top x_0]||_2} \rightarrow \pm e_1$ (denominator is normalizing). If we pull out $\lambda_1^k$ from $\Lambda^K$ we see that $\Lambda_k$ converges to $\lambda_1^k$. Then we converge to $V(\pm e_1) = \pm v_1$.

In many cases, data will not lie in a linear space. However, we also want to try to find the simplest possible model, so we need to make some assumption on the space the data lives. So we seek more flexible “manifold” models for data (which may just be using the linear model after doing some appropriate manipulation to the data).

The term “manifold” is used loosely to mean some kind of low-dimensional set, like a sphere or a circle in some high-dimensional space. Manifold is a technical term from differential geometry which is hard to define in a straightforward way. On a manifold, a very small neighborhood looks like a low-dimensional space (flat). To define formally, we need to define what it means to “look like”: - depending on how you define this, you get different types of manifolds. A Riemannian manifold, for example, is one where we can measure lengths of vectors which are tangent to it. Knowing the more formal definition of a manifold is important so that we can make assumptions about the data. We care about the distance between data points in the set of possible data points.

Given $x_1, \ldots x_n \in \Re^D$ where $x_j$ lives near some subset $M \subset \Re^D $, define a distance $d_M(x, x^\prime) = \mathrm{inf}_{\gamma(t) \in M} \mathrm{Length}(\gamma)$ where $\gamma(t)$ joins $x, x^\prime$. We'd like to find a low dimensional mapping $\phi$ which maps $M$ to $\Re^d, d = \mathrm{dim}(M)$ such that $||\phi(x_i) - \phi(x_j)|| \approx d_M (x_i, x_j)$ or find some examples $y_1, \ldots y_n \in \Re^d$ such that $||y_i - y_j||_2 = d_M(x_i, x_j)$. We can say that $M$ is isometric to a (subset of) $\Re^d$ if it's actually possible that we can embed $M$ into this $d$-dimensional space and preserve these distances. If the points are pretty close together in the space, then the norm of (distance between) the points in the space is small and it is probably close to the distance along the manifold.

We can construct a matrix $D \in \Re^{n \times n}$ where $D_{ij} = d_M(x_i, x_j)$, a matrix of the distances between data points along the manifold. If we already know all of the distances along $M$, we can do the mapping by $D_{ij}^2 = ||y_i - y_j||_2^2 = ||y_1||_2^2 + ||y_j||_2^2 - 2\langle y_1, y_2 \rangle$ and $D_{ij}^2 = Q = \gamma\mathbf{1}^T - \mathbf{1}\gamma^T - 2y^Ty$ where $\gamma = [||y_1||_2^2; \ldots; ||y_n||_2^2]$ and $\mathbf{1} = [1; 1; \ldots 1]$. Defining $H = I - \frac{1}{n}\mathbf{1}\mathbf{1}^T$, we have $Hx = x - \frac{1}{n}\mathbf{1}\mathbf{1}^Tx = x - \mathbf{1}(\frac{1}{n}\sum_i x_i)$ such that multiplication by $H$ corresponds to subtracting the mean. Now if we multiply $Q$ on the left and right by $H$ we get $\frac{1}{2}H^TQH = -\frac{1}{2}H^T(\gamma \mathbf{1} + \mathbf{1} \gamma^T - 2y^Ty)H$. Multiplying $\mathbf{1}$ by $H$ gives zero so $-\frac{1}{2}H^T Q H = H^T y^T yH = (yH)^T(yH)$. Now $yH$ is a matrix whose columns are translated to have mean 0 which we can call $\tilde{y}$, so $(yH)^T(yH) - \tilde{y}^T\tilde{y}$. Getting the eigenvalue decomposition gives $(yH)^T(yH) - \tilde{y}^T\tilde{y} = V \Lambda V^T = (V\Lambda^{1/2})(V \Lambda^{1/2})^T$ so we can take $\tilde{y} = (V\Lambda^{1/2})^T$. We can then show that $D$ is a distance matrix for a collection of points in $\Re^d$ if and only if $\mathrm{rank}(-\frac{1}{2}H^TQH) \le d$. The columns of $\Lambda^{1/2}V^T$ are the embedding.

Given a set of points without distances, one possible assumption is that the distance along the points in the “ambient” space is close to the distance of the points along the manifold. Then we approximate $D_{i, j} = ||x_i - x_j||$ when $||x_i - x_j|| < \epsilon$ (the points are close together). This gives us a subset $\Omega$ of pairs for which we believe the distance measurement. We can then make a graph $G$ with vertices $1, \ldots, n $, edges $(i, j)$ for all pairs $(i, j) \in \Omega$ whose distances we believe, and weights $w_{i, j} = ||x_i - x_j||_d$. We can then estimate $D_{ij}$ for some $i, j$ not in $\Omega$ with the shortest path length in the graph $G$. This is called isomap - an isometry from the data into $\Re^d$.

We have to be able to assume that the data in question has some low-dimensional representation in order for this distance to make sense. In some cases you can count the actual number of dimensions which makes sense for the low-dimensional space (number of parameters for the transformations). This approach can fail when the manifold is curved such that points that are far away on the manifold are close in the space, or when there is a hole which is not sampled, or when the manifold is non-Euclidian. If none of these things happen, it can be proven that the embedding works. Certain types of motions give you Euclidian manifolds - pivoting discs, translating symmetric shapes… For two discs, or a jointed shape with multiple joints, the manifold is not euclidian. We'd like to minimize $\min_Q \mathrm{rank}(-\frac{1}{2}H^T QH)$ such that $Q_{ij}^2 = D_{ij}, (i, j) \in \Omega$ - the simplest model that fits the data. If we can solve an optimization problem of this type, we may be able to avoid some of these problems. We could also impose additional constraints on the solution.

Given some function $f : \Re^n \rightarrow \Re$ we'd like to minimize $f(x)$ over some domain $x \in D$. For example, $\min f(x) : g_1(x) \le 0, \ldots g_n(x) \le 0$ - the inequalities force us to live in the sub-level set of the constraints $g$. Say $x_\star$ is a solution if $x_\star \in D$ and $f(x_\star) \le f(x) \forall x \in D$. We'd like to efficiently compute $x_\star$, or just approximate it. We'd also like to know when the formulation fit the data/goal.

It's possible that the function $f$ won't have a solution - for example, when $f(x) = x^3$, we can always make $x$ more negative and find a smaller $f(x)$. So it may help to make the domain bounded. If the function is discontinuous, it is possible that there's no minimum - we may be able to pick a sequence of points which approach a minimum, but never achieve the minimum because of the discontinuous jump in the function. If the boundary is open (eg $0 < x < 1$) we can have the same problem - making the function closer and closer to the minimum without ever reaching it. If none of these warning signs happen, then we'd like to be able to know that the optimization problem has a solution.

Denote $B(x, r) = \{ y \;|\; ||y - x||_2 \le r \}$ as the ball of radius $r$. A domain $D \subset \Re^n$ is bounded if there exists a finite $r > 0$ such that for all $x \in D$, $||x|| \le r$, or equivalently $D \subseteq B(0, r)$. A domain $S \subset \Re^n$ is open if for every $x \in S$, there exists $r > 0$ such that $B(x, r) \subseteq S$. Here, $r$ can depend on $x$. A set $S \subset \Re^n$ is closed if its complement $S^C = \{ x \;|\; x \not\in S\}$ is open. A function $f : \Re^n \rightarrow \Re$ is continuous if for every sequence $(x_k)$ converging to $x$, $f(x_k) \rightarrow f(x)$.

** Theorem **
Suppose $D \in \Re^n$ is closed and bounded, and $f$ is continuous. Then $f$ has a minimum in $D$.

** Proof **
A sequence of scalars $t_1, t_2, \ldots$ has a monotone subsequence $t_{k_1}, t_{k_2}, \ldots$ with $k_1 < k_2 < \ldots$ (nondecreasing or nonincreasing). A “cap” $i$ is an index where $t_i \ge t_{i^\prime} \forall i^\prime > i$, and the sequence of caps $k_1, k_2, \ldots$ is monotone nonincreasing. If there are infinitely many caps, we are done. If there isn't, then there's a final cap $k_1$, and we can set $k_2$ to the index of the next greater value, and $k_3$ to the next, etc. A bounded monotone sequence converges if $t_{k_1} \ge t_{k_2} \ge \ldots$ and $t_{k_i} \ge B$ (they are all bounded) then $\exists t_\star : t_{k_i} \rightarrow t_\star$. A bounded sequence of vectors has a convergent subsequence by applying the arguments for scalars to the coordinates.

Given $A \in \Re^{n \times n}$, maximize $x^T Ax$ such that $x^T x \le 1$ with $D = \{x \;|\; ||x||_2 \le 1\}$. This has the same solution as $-\min -x^TAx$, so we can find a solution via Weierstrass' theorem. If $A$ is symmetric and positive semidefinite (all eigenvalues are greater than 0), then the solution to this optimization is the largest eigenvalue of $A$.

Minimize $f(x)$ subject to $g_1(x) \le 0, \ldots g_n(x) \le 0, h_1(x) = 0, \ldots h_m(x) = 0$ with $f$ continuous and $g_i, h_j$ are continuous then as long as $D = \{x | g_i(x) \le 0, h_j(x) = 0\}$ is bounded, then we have a minimizer. The fact that $g_i, h_j$ are continuous implies that the sub level set $\{x \;|\; g_i(x) \le 0\}$ is closed.

We have a matrix $A$, a vector $b$ and we want to solve $\min f(x) \mathrm{\;s.t.\;} Ax = b$. We constrain the solution space to be an affine subspace of the form $x_0 + N$ where $N$ is the nullspace of $A$ and $x_0$ is some particular solution. If $||x|| \rightarrow \infty$ implies that $f(x) \rightarrow \infty$ (domain is unbounded) we can still find a solution given that the function continues to grow by just constraining the bounds.

Functions can have “unpleasant” shape - they can have local minima, or have many equivalent global minima (like a sine wave). A naive scheme can then get “trapped” in a local minima. The domain can also be ugly such that a global minimum is not reachable from some starting point by following the function's contour and staying in the domain. In some applications, it is not important to find the global optimum, it's just good to find a local optimum which is good enough for the problem's criteria. Convex problems have a good shape of function and domain, which allow them to be very feasibly computable.

A set is convex if we take a point and make local decisions, we will stay in the set. In other words, draw a line from one point to another (linearly interpolate) in the set and the line will lie entirely in the set's domain. Formally, $D$ is convex if $\forall x, y \in D, \forall t \in [0, 1], tx + (1 - t)y \in D$. There may be other structures which can still give a solution, but this is one of the most general constraints. For a nonconvex set, you can create a convex set which contains it and use that as the domain and hope that there is not a new minimal point introduced by growing the set.

- A linear subspace $L$ is convex because if $x, y \in L, \alpha, \beta \in \Re, \alpha x + \beta y \in L$.
- An affine subspace $A = \{x + y \;|\; x \in L\}$ is also convex.
- An affine subspace is stable under affine combinations $\lambda_1 x_1 + \ldots + \lambda_n x_n \;|\; \sum_i \lambda_i = 1$.
- The norm ball $D = \{x \;|\; ||x|| \le 1 \}$ (looks like a ball for Euclidian norm) is convex: If we take $x, y \in D, t \in [0, 1]$ then $||tx + (1 - t)y|| \le ||tx|| + ||(1 - t)y|| = t||x|| + (1-t)||y|| \le t + 1-t = 1$, so it still lies in the ball.
- If we take some vector $a \in \Re^n \ne \vec{0}, b \in \Re$, then the half-space $D = \{x \;|\; a^\top x \le b \}$ is convex: If $x, y \in D$ then $a^\top x \le b, a^\top y \le b$ so $a^\top(tx + (1 - t)y) = ta^\top x + (1-t)a^\top y \le tb + (1 - t)b = b$.
- If we take $x_1, x_2, \ldots x_p$ and we take all the points obtainable by interpolating between these points, $P = \mathrm{conv}(x_1, \ldots, x_p) = \{ \lambda_1 x_1 + \ldots \lambda_p x_p \;|\; \lambda_i \ge 0 \forall i, \sum_i \lambda_i = 1 \}$, this convex polytope is convex because we have explicitly included all linear combinations of the points.
- If $x, y \in P, \exists \lambda_1, \ldots, \lambda_p \;|\; \sum_i \lambda_i x_i = x, \gamma_1, \ldots, \gamma_p \;|\; \sum_i \gamma_i x_i = y$ then $tx + (1 - t)y = \sum_i (t\lambda_1 + (1 - t)\gamma_i) x_i$. If we combine points in this way $\lambda_1x_1 + \ldots + \lambda_p x_p; \lambda_i \ge 0, \sum_i \lambda_i = 1$ we call it a convex combination.

Say we know that the solution must satisfy $x \in D_1, x \in D_2$. To enforce these constraints simultaneously, we'd like $x \in D_1 \cap D_2$. An intersection between convex sets are convex because if $x \in D_1 \cap D_2$ then $x \in D_1, x \in D_2$, so if $x, y \in D_1 \cap D_2$ then $tx + (1 - t)y \in D_1, tx + (1 - t)y \in D_2$ so $tx + (1 - t)y \in D_1 \cap D_2$.

A linear map $x \rightarrow Ax + b$ also preserves convexity. If $A \in \Re^{m \times n}, b \in \Re^m, f(x) = Ax + b$, then if $C$ is a convex set and $f(C) = \{f(x) \;|\; x \in C\}$ then $f(C)$ is also convex. This can also be shown based on the definition of convex sets.

A perspective transformation takes a set $C$ and projects it onto a lower dimensional space by including all of the points where lines from the set to the origin intersect a plane. Formally, let $f(x, t) = \frac{1}{t} x : \Re^{n + 1} \rightarrow \Re^n, t > 0$ (we divide by the last coordinate). Given $C \in \Re^{n + 1}$ convex with $t > 0 \forall (x, t) \in C$, if we set $f(C) : \{f(x, t) \;|\; (x, t) \in C \} \in \Re^n$ then $f(C)$ is convex.

If we take a linear fractional transformation defined by $f(x) = \frac{Ax + b}{c^\top x + d}, x \in \Re^n, A \in \Re^{m \times n}, b \in \Re^m, c \in \Re^n, d \in \Re$ (a linear transformation divided by a linear function of $x$). If $C \in \Re^n$ and we set $f(C) = \{f(x) \;|\; x \in C\}$ then $f(C)$ is convex. This is just an affine transformation combined with a perspective transformation.

A function $f : \Re^n \rightarrow \Re$ is convex if its domain (allowable input values) is a convex set and for some $x, y$ in its domain and $0 \le \theta \le 1$ we have $f(\theta x + (1 - \theta)y) \le \theta f(x) + (1 - \theta) f(y)$ (Jensen's inequality). This means that a line segment between the function at $x$ and $y$ lies above the function. Concave is the same definition for $-f$. We can define an extended value extension of $f$ by setting $f(x) = \infty$ when $x$ is outside of the domain of $f$.

If $f$ is differentiable, then $f$ is convex if and only if $f(y) \ge f(x) + \nabla f(x)^\top (y - x)$. This implies that the tangent line to a convex function at any point (its first order Taylor approximation) lies under the function (is a global underestimator).

**Theorem**
If $\nabla f(x) = 0$ then $x$ is a global minimizer of $f$.

**Proof**
For $f : \Re \rightarrow \Re$, first assume $f$ is convex so that for all $0 < t \le 1$, $x + t(y - x)$ is in the domain of $f$ when $x, y$ are in the domain of $f$. Since $f$ is convex we have $f(x + t(y - x)) \le (1 - t)f(x) + tf(y) \rightarrow f(y) \ge f(x) + \frac{f(x + t(y-x)) - f(x)}{t}$ which approaches $f(x) + f^\prime(x)(y - x)$ as $t\rightarrow \infty$. In the other direction, if we assume $f(y) \ge f(x) + f^\prime(x)(y - x)$, and we take $x \ne y$ and $0 \le \theta \le 1$ and let $z = \theta x + (1 - \theta)y$, then $f(x) \ge f(z) + f^\prime(z)(x - z)$ and $f(y) \ge f(z) + f^\prime(z)(y - z)$. Multiplying the first by $\theta$ and the second $1 - \theta$ and adding them yields $\theta f(x) + (1 - \theta)f(y) \ge f(z)$. We can then extend this result to $f : \Re^n \rightarrow \Re$ by restricting $f$ to the line passing through $x, y \in \Re^n$.

We would like to formally say that the derivative of a convex function is always increasing - it has upward curvature everywhere, and looks like a bowl. In one dimension, this is formulated as $\frac{\delta^2 f}{\delta x} \ge 0$.

**Theorem**
If $f$ is twice differentiable (its Hessian $\nabla^2f$ exists in its entire domain), then $f$ is convex if and only if its Hessian is positive semidefinite $\nabla^2 f \succeq 0$ (the function has “upward” curvature everywhere).

**Proof**
To prove, we can first show the first order characterization $\forall x, y, f(y) \ge f(x) + \nabla f(x)^\top (y - x)$. If we think about the segment joining $[x, y]$ we can find some point $z$ on that segment such that $f(y) = f(x) + \nabla f(x)^\top (y - x) + \frac{1}{2} (y - x)^\top \nabla^2 f(z)(y - x)$ (a result from calculus). Since $\nabla^2 f(z)(y - x)$ is nonnegative, we have $f(y) \ge f(x) + \nabla f(x)^\top (y - x)$. In the other direction, we want to show that if $\exists x \mathrm{\;s.t.\;} \nabla^2 f(x) \not \succeq 0$ then $f$ is not convex. If $f(x) \not \succeq 0$ then $\exists z \mathrm{\;s.t.\;} z^\top \nabla^2 f(x)z < 0$. If we have $f(x + tz) = f(x) + \nabla f(x)^\top (x + tz - x) + \frac{1}{2}t^2 z^\top \nabla^2 f(x) z + \mathcal{O}(t^2)$, then for small $t$, we have $f(x + tz) < f(x) + \nabla f(x)^\top (z + tz - x)$ because $\frac{1}{2}t^2 z^\top \nabla^2 f(x) z$ was assumed to be negative. This breaks the first order condition.

Convexity (or concavity) can be verified either directly verifying the definition, verifying the Hessian is positively semidefinite, or restricting the function to an arbitrary line and verifying convexity of the resting function of one variable.

- Linear and affine functions are convex (and concave)
- $f(x) = x^\top Ax + b^\top x + c$ is convex if $A \succeq 0$
- Exponential functions $e^{ax}$ are convex for $a \in \Re$
- $x^a$ is convex on $\Re_{++}$ when $a \ge 1$ or $a \le 0$ and concave otherwise
- $|x|^p$ is convex if $p \ge 1$
- $\log(x)$ and $x \log(x)$ are concave on $\Re_{++}$ (note that $\log^{\prime\prime}(x) = -\frac{1}{x^2}$)
- Every norm on $\Re^n$ is convex because $||x + y|| \le ||x|| + ||y|| \rightarrow ||tx + (1 - t)y|| \le ||tx|| + ||(1 - t)y|| = t||x|| + (1 - t)||y||$.
- $\max\{x_1, x_2, \ldots x_n\}$ is convex
- The geometric mean is concave on $\Re_{++}^n$
- The arithmetic mean $\frac{1}{n} \sum_i^n x_i$ is both concave and convex
- $f(x, y) = \frac{x^2}{y}$ is convex for $y > 0$ (use the second order convexity condition)
- A log-sum of exponentials $\log(e^{x_1} + e^{x_2} + \ldots + e^{x_n})$ is convex

The $\alpha$-sublevel set of a function $f : \Re^n \rightarrow \Re$ is defined as $C_\alpha = \{x \in \mathrm{dom} f \;|\; f(x) \le \alpha\}$ - the values of $x$ in the domain of $f$ which produce an $f(x)$ which is less than or equal to $\alpha$. Sublevel sets of a convex function are convex for any $\alpha$. The super level is $\{x \in \mathrm{dom} f \;|\; f(x) \ge \alpha \}$.

The epigraph is the set of points above a function, given by $\mathrm{epi}(f) = \{(x, t) \;|\; x \in \mathrm{dom}(f), f(x) \le t \}$. A function is convex if and only if its epigraph is convex. The hypograph is the opposite (points below the function). The hyperplane tangent to a convex function supports its epigraph.

If $f$ is convex, $\theta_1, \ldots, \theta_k \ge 0, \sum_{i = 1}^k \theta_i = 1$ then $f(\sum_{i = 1}^k \theta_i x_i) \le \sum_{i = 1}^k \theta_i f(x_i)$. Jensen's inquality and convexity can produce many interesting inequalities.

- A nonnegative weighted sum of convex functions $f = w_1f_1 + \ldots w_m f_m$ is convex: $f(tx + (1 - t)y) = \sum_i w_i f_i(tx + (1 - t)y) \le \sum_i w_i (t f_i(x) + (1 - t)f_i(y)) = t\sum_i w_i f_i(x) + (1 - t)\sum_i w_i f_i(y)$
- If $f(x)$ is convex and we define $g(x) = f(Ax + b)$ with $\mathrm{dom}(g) = \{x \;|\; Ax + b \in \mathrm{dom}(f)\}$ then $g$ is convex: $g(tx + (1 - t)y) = f(A(tx + (1 - t)y)) + b) = f(t(Ax + b) + (1 - t)(Ay + b)) \le t f(Ax + b) + (1 - t)f(Ay + b) = tg(x) + (1 - t)g(y)$.
- An affine transformation of the domain.
- The pointwise maximum $f(x) = \max\{ f_1(x), f_2(x), \ldots f_n(x) \}$ of convex functions $f_1, f_2, \ldots f_n$ is convex: $f(tx + (1 - t)y) = \max_i f_i (tx + (1 - t)y) = f_{i^\star} (tx + (1 - t)y) \le$$ tf_{i^\star} (x) + (1 - t)f_{i^\star} (y) \le tf(x) + (1 - t)f(y)$ (for some $i^\star$ which maximizes).
- If $f(x, y)$ is convex in $x$ for $y \in \mathcal{A}$, then the point wise supremum $\mathrm{sup}_{y \in \mathcal{A}} f(x, y)$ is convex in $x$. This is the intersection of epigraphs over $y \in \mathcal{A}$. A function's convexity can be shown by expressing it as the point wise supremum of a family of affine functions - for example, the point wise supremum of the set of all affine global underestimators of a function. Given $h : \Re^k \rightarrow \Re$ and $g : \Re^n \rightarrow \Re^k$, define $f(x) = h(g(x))$ with $\mathrm{dom}(f) = \{ x \in \mathrm{dom}(g) \;|\; g(x) \in \mathrm{dom}(h) \}$. Assuming $\mathrm{dom}(g) = \Re$ and $\mathrm{dom}(h) = \Re^k$ and $g$ and $h$ are twice differentiable, the convexity can be shown from $f^{\prime\prime}(x) = g^\prime(x)^\top \nabla^2 h(g(x))g^\prime(x) + \nabla h(g(x))^\top g^{\prime \prime}(x)$.
- If $f$ is convex in $(x, y)$ and $C$ is a convex nonempty set, then $g(x) = \mathrm{inf} f(x, y)$ is convex in $f$ provided that $g(x) > -\infty$ for some $x$ with $\mathrm{dom}(g) = \{ x \;|\; (x, y) \in \mathrm{dom}(f), y \in C \}$.
- If $f : \Re^n \rightarrow \Re$, then $g : \Re^{n + 1} \rightarrow \Re$ defined as $g(x, t) = tf(x/t)$ with domain $\mathrm{dom}(g) = \{(x, t) \;|\; x/t \in \mathrm{dom}(f), t > 0 \}$ preserves convexity.
- Let $S_+(n)$ be the symmetric positive semidefinite matrices of size $n \times n$, and let $f : S_+(n) \rightarrow \Re$ via $f(A) = \lambda_{\max} (A)$, then $f$ is convex because $\lambda_{\max} (A) = \mathrm{sup}_{||x||_2 = 1} x^\top Ax$ (true for symmetric $A$)
- Given convex $f(x, y)$ set $g(x) = \min_{y \in C} f(x, y)$ with $C$ convex, then $g(x)$ is convex. Taking $x, z$, imagine there exist minimizers $y_x$ for $f(x, y)$, $y_z$ for $f(z, y)$ then $g(tx + (1 - t)z) \le f(tx + (1 - t)z, t_yx + (1 - t)y_z) \le$$ tf(x, y_x) + (1 - t)f(z, y_z) = tg(x) + (1 - t)g(z)$.

Given $f: \Re^n \rightarrow \Re$, the conjugate function $f^\ast : \Re^n \rightarrow \Re$ is defined as $f^\ast(y) = \mathrm{sup}_{x \in \mathrm{dom}(f)} (y^\top x - f(x))$. It is the point $x$ at which $f(x)$ is maximally small compared to the hyperplane $y^\top x$. It will occur where $f^\prime(x) = y$ if $f(x)$ is differentiable, or in other words, $f^\ast(y)$ can be interpreted as the y intercept of the tangent line of $f$ that has slope $y$. More specifically, if $f$ is convex and differentiable with $\mathrm{dom}(f) = \Re^n$, and if $y = \nabla f(x^\ast)$ where $x^\ast$ is a maximizer of $y^\top x - f(x)$ then $f^\ast(y) = x^{\ast \top} \nabla f(x^\ast) - f(x^\ast)$. It is a convex function (the point wise supremum of a family of convex functions of $y$). Fenchel's inequality states $f(x) + f^\ast(y) \ge x^\top y$ for all $x, y$. If $f$ is convex and its epigraph is a closed set, then $f^{\ast\ast} = f$. The conjugate sum of convex functions of different variables is the sum of their conjugates.

Say we are given a set of points $x_1, \ldots, x_n \in \Re^d$ each with an associated label $y_i \in \pm 1$, and we want to find some function $g : \Re^d \rightarrow \{\pm 1\}$ such that given some new input we can predict with some accuracy what the label would be. The minimal assumption we would make is that $(x_i, y_i) \sim_{iid} P(x, y)$ - they are independent samples drawn from the same joint distribution over a random vector $x$ and a binary random variable $y$. We want to either learn the distribution from the data or directly try to find a function that gives a good classification behavior. Given the training examples, a common technique is to try to draw a decision boundary which separates those labeled $+1$ and those labeled $-1$. Then we can decide what set a new input lies in, we can look at which side of the separator it falls on.

A simple model uses a linear separator - find a hyperplane $H = \{x \;|\; a^\top x + b = 0 \}$ that separates the training examples. Ideally, $\forall i : y_i = 1, a^\top x_i + b > 0$ and $\forall i : y_i = -1, a^\top x_i + b < 0$ or equivalently $\forall i : y_i(a^\top x_i + b) > 0$. Typically, if there exists such a hyperplane, it can be moved or rotated somewhat and still separate the data appropriately - so there could be no solutions or many solutions. So, we might like to find a hyperplane $H$ which maximizes the distance from each point to the hyperplane. If we restrict $||a||_2 = 1$, then $a^\top x + b$ is just the orthogonal distance to the hyperplane $H$. We can then define the margin as $M = \min_i y_i(a^\top x_i + b)$. Then our problem is $\max_{a, b, M} M$ such that $\forall i, y(a^\top x_i + b) \ge M, ||a||_2 = 1$. This could be turned into a minimization problem by minimizing over $-M$.

The optimization is convex, but the constraint is not. The constraint that $M - y_i(a^\top x_i + b) \le 0$ is a linear function of our unknowns $a, b, M$ which forces each of them to live in a convex subset. The constraint that $||a||_2 = 1$ is not a convex constraint because it is an equality, not an inequality. If we let $a^\prime = a/M, b^\prime = b/M$ and note that $M = \frac{1}{||a^\prime||_2}$ then our problem becomes $\max_{a^\prime, b^\prime} \frac{1}{||a^\prime||_2}$ such that $\forall i, y_i(a^{\prime\top} x_i + b^\prime) \ge 1$. Inverting the optimality constraint we have $\min_{a^\prime, b^\prime} ||a^\prime||_2$ such that $\forall i, y_i(a^{\prime\top} x_i + b^\prime) \ge 1$ which is now a convex optimization problem.

The strong assumption is that $\exists H$ that separates $x_i$ with label $1$ from the $x_i$ with label $-1$. In practice, it's not uncommon for the data to be non-separable. We can relax the problem to $\min_{a^\prime, b^\prime, \epsilon_i} ||a^\prime||_2$ such that $\forall i, y_i (a^{\prime \top} x_i + b^\prime) \ge 1 + \epsilon_i, \sum_i \epsilon_i \le k, \epsilon_i \ge 0$. For $0 < \epsilon < 1$, we are allowing the data point $x_i$ to live in the margin. For $\epsilon_i > 1$, the point $x_i$ is on the wrong side of the hyperplane. This is also a convex problem.

In logistic regression, instead of making a linear separator between the points $x_i$ by their labels $y_i$, we look at $\log \left( \frac{P(Y = 1 \;|\; X = x)}{P(Y = -1 \;|\; X = x)} \right) = a^\top x + b$. If we exponentiate both sides, $P(y = 1, \;|\; X = x) = \frac{e^{a^\top x + b}}{1 + e^{a^\top x + b}}, P(y = -1, | X = x) = \frac{1}{1 + e^{a^\top x + b}}$. If we then take the log, we have $\log(P_{ab}(Y = 1, \;|\; X = x)) = a^\top x + b - \log(1 + e^{a^\top x + b}),$ $\log(P_{ab}(Y = -1, \;|\; X = x)) = - \log(1 + e^{a^\top x + b})$. One way of fitting $a, b$ would be to maximize the likelihood of the data $x_i, y_i$, defined by $l(a, b) = \log (P_{ab} (Y_1 = y_1, \ldots Y_n = y_n \;|\; X_1 = x_1, \ldots X_n = x_n) )$$ = \log \left( \prod _{i = 1}^n P_{ab}(Y_i = y_i \;|\; X_i = x_i) \right)$$ = \sum_{i = 1}^n \log \left( P_{ab} (Y_i = y_i \;|\; X_i = x_i) \right) = \sum_{i = 1}^n \frac{y_i + 1}{2} (a^\top x_i + b) - \log(1 + e^{a^\top x_i + b})$. The maximum likelihood estimate is then $\max_{a, b} l(a, b) = \min_{ab} -l(a, b) =$$ \min_{a, b} \sum_{i = 1}^n -\left( \frac{y_i + 1}{2} \right)(a^\top a_i + b) + \log( 1 + e^{a^\top x_i + b} )$. This leads to a similar picture as in linear SVM, but with difference principals. This problem is the sum of of a series of linear functions $\left( \frac{y_i + 1}{2} \right)(a^\top a_i + b)$ and a series of functions of the form $\log( 1 + e^{a^\top x_i + b} )$. Using the rule that if $f(t)$ is convex in a scalar $t$, $f(a^\top x_i + b)$ is convex in $a, b$, we need to show that $f(t) = \log(1 + e^\top )$ is convex. Taking derivatives, we have $f^\prime (t) = \frac{e^\top }{1 + e^\top }$ and $f^{\prime\prime}(t) = \frac{e^\top }{1 + e^\top } + e^\top (-(1 + e^\top )^{-2})e^\top = \frac{e^\top }{1 + e^\top } - \left( \frac{e^\top }{1 + e^\top } \right)^2 > 0$ since $\frac{e^\top }{1 + e^\top } \in (0, 1)$. So this problem is also convex.

Say that we are given a big linear system of equations which is maybe only approximately satisfied $y \approx Ax$ with $A \in \Re^{m \times n}, y \in \Re^m, x \in \Re^n$. If $m = n$ and $A$ is nonsingular, we can solve the system for $x$ by inversion $x = A^{-1}y$. In some applications, we instead have $m \ll n$. One situation where this arises is in the high dimensional embedding problem, where we have a set of points in a high dimensional space and we want to project them to a lower dimensional space while preserving distances. We can interpret this problem as trying to infer some missing entries in a big data matrix $X$. $A$ is then the linear map which takes $X$ to a subset of its entries $y$. Our goal is given $A[X] = y$, we want to find $X$. In sampling and reconstruction, we may have a few discrete samples from the continuous world and we want to reconstruct a signal at higher sampling rate. These problems require some assumption on the data, like it being low-rank or composed of a superposition of a few elements from a dictionary. These constraints lead to optimization problems of the form $\min_x \Psi(x)$ subject to $Ax = y$ where $\Psi(x)$ is a convex function which encourages $x$ to have the “right structure” according to these constraints.

If our convex minimization problem has constraints like $\min f(x) \;|\; x \in D \subseteq \Re^m$ where $D$ is a convex set, we can show that $x_\star \in D$ is optimal if and only if $\forall y \in D, f(x_\star)^\top (y - y_\star) \ge 0$. If we let $f, g$ be convex functions and consider the problem of $\min f(x) \mathrm{\;s.t.\;} g(x) \le 0$ then the domain $g(x) \le 0$ is convex. The graph is a bowl shaped function constrained to a convex domain $D$. A point $x_\star$ is optimal either when it's somewhere in the middle of the domain or on the edge. If $x_\star \in \mathrm{int}(D)$ then $g(x_\star)$ is in the middle of the domain and it must satisfy $\nabla f(x_\star) = 0$. If $x_\star \in \mathrm{bdy}(D)$ then $g(x_\star) = 0$. For $x_\star$ to be an optimizer it would be sufficient that $\nabla f = 0$, but it can also be that the function is just decreasing at the boundary. Looking at the subset $\{g(x) = 0\}$, if we can find some $\lambda \ge 0$ such that $\nabla f(x_\star) = -\lambda \nabla g(x_\star)$ then $x_\star$ is a minimizer. This requires that $\nabla f(x)$ points “into” the domain, which means that as you enter the domain the function increases. These are the two possibilities for finding a global minimum.

If we have $\min f(x) \mathrm{\;s.t.\;} g_1(x) \le 0, g_2(x) \le 0$ with $g_1, g_2$ convex, then we are restricting our domain to the intersection of the two domains. The intersection is still a convex set. At the boundary of the intersection of the domains, either we are at the boundary of one function or the other or we are at a “corner”, where we must have $\nabla f(x_\star) = -\lambda_1 \nabla g_1(x_\star) - \lambda_2 \nabla g_2(x_\star)$ with $\lambda_1, \lambda_2 \ge 0$. Note that we can rearrange this constraint to $\nabla f(x_\star) + \lambda_1 \nabla g_1(x_\star) + \lambda_2 \nabla g_2(x_\star) = 0$. It looks like we are taking a gradient of some function, call it $L(x, \lambda) = f(x) + \lambda_1 g_1(x) + \lambda_2 g_2(x)$ where $L : \Re^m \times \Re^2 \rightarrow \Re$. Then our requirement for a minimizer is that $\nabla_x L(x, \lambda) = 0, g_i(x) = 0$ or $\lambda_i = 0$.

Motivates the Lagrangian…

Consider $\min f(x) \mathrm{\;s.t.\;} g_1(x) \le 0, \ldots g_n(x) \le 0, h_1(x) = 0, \ldots h_p(x) = 0$ where $f, g_i$ are convex and $h_j$ are affine ($h_j(x) = a_j^\top x + b_j, a_j \in \Re^m, b_j \in \Re$). Then we can define $L(x, \lambda, \nu) = f(x) + \sum_{i = 1}^n \lambda_i g_i(x) + \sum_{j = 1}^p \nu_j h_j(x)$ where $x \in \Re^m, \lambda \in \Re^n, \nu \in \Re^p$. This is called the Lagrangian. We call $x$ the “primal variable”, and $\lambda, \nu$ are called “dual variables” or “Lagrange multipliers”. If we imagine that $\lambda_i \ge 0$ and ignore the equality constraints $h_j$, we can think of $L(x, \lambda, \nu)$ as a softened or penalized counterpart to the original constraint problem. If $x$ is a feasible for the original constrained optimization problem (it lies within the domain) and $\lambda_i \ge 0$, then we have that $L(x, \lambda, \nu) \le f(x)$ (it always is a lower bound) because $g(x) \le 0$ so we are adding a negative amount to $f(x)$ and all of the $h_j(x)$ terms are zero (it's feasible).

If we make a function $I(f) = \{0, t \le 0; \infty, t > 0\}$, then our original problem is equivalent to minimizing $f(x) + \sum_{i = 1}^n I(g_i(x)) + \sum_{j = 1}^p I(h_j(x)) + I(-h_j(x))$. The Lagrangian is then basically a linear underestimator of this problem. Then we also have $\min_{x \in \Re^m} L(x, \lambda, \nu) \le \min_{x \in D} L(x, \lambda, \nu) \le \min_{x \in D} f(x) = p^\star$, the “primal optimal value”, giving $d(\lambda, \nu) = \min_{x \in \Re^m} L(x, \lambda, \nu) \le p^\star$, the “dual function”. Then our inequality becomes $\forall \lambda \ge 0, \nu, d(\lambda, \nu) \le p^\star$ (note - $\lambda \ge 0$ indicates that $\forall i, \lambda_i \ge 0$). Combining everything, we have $d(\lambda, \nu) = \min_x L(x, \lambda, \nu) = \min_x f(x) + \sum_{i = 1}^n \lambda_i g_i(x) + \sum_{j = 1}^p v_j h_j(x)$. You can think of it as a minimization of all of the constraint functions over $x$. Since we are taking the minimum of a bunch of affine functions, $d(\lambda, \nu)$ is a concave function which we can maximize. If we denote $d^\star = \max_{\lambda \ge 0, \nu} d(\lambda, \nu) \le p^\star$, then this is a dual problem, which is always a concave maximization problem whose optimal solution is always less than or equal to the optimization of the problem we started out with - it's a weak lower bound. This property is sometimes called “weak duality”. In some cases we can have strong duality, $d^\star = p^\star$.

Given the problem $\min_x x^\top x \mathrm{\;s.t.\;} Ax = b$, we have $f(x) = x^\top x$, $A \in \Re^{p \times m}$ and $h_j = a_j x - b_j$, so our Lagrangian is $L(x, \nu) = x^\top x + \nu^T(Ax - b)$. The dual is then $d(\nu) = \min_x L(x, \nu)$. We can minimize by setting $\nabla_x L(x, \nu) = 2x + A^\top \nu = 0$. We then have $x_\star = -\frac{1}{2}A^\top \nu$ and $L(x_\star, \nu) = \frac{1}{4} \nu^\top A A^\top \nu + \nu^\top (A (-\frac{1}{2}A^\top \nu) - b)$ and $d(\nu) = -\frac{1}{4} \nu^\top A A^\top \nu - \nu^\top b \le \min \{x^\top x \mathrm{\;s.t.\;} Ax = b\}$.

If we think about the set of values of $g_i$, $h_j$ we can generate, we can create a set $G \subseteq \Re^m \times \Re^p \times \Re$ defined by $G = \{ (u, v, t) \;|\; u = (g_1(x), \ldots g_n(x)); v = (h_1(x) \ldots h_p(x)), t = f(x)\}$ for some $x \in \Re^m$. The way we get the primal optimal value is to look over all the feasible points (the points in this set $G$) for the one which minimizes the objective function. Looking at the dual function $d(\lambda, \nu) = \min_x f(x) + \sum_i \lambda_ig_i(x) + \sum_j \nu_j h_j(x)$ we see that the dual is a minimization $\{ \langle (\lambda, \nu, 1), (u, v, t) \rangle \;|\; (u, v, t) \in G \}$ using the value $\lambda, \nu, 1$. We can think of this minimization as choosing $\lambda, \nu$ and making a vector $(\lambda, \nu, 1)$ and intersecting $G$ with all hyperplanes with normal $(\lambda, \nu, 1)$. The values of the dual problem are the points where the hyperplanes intersect the $f(x)$ axis. The dual problem is looking over all choices of the normal vector and looking at the largest intercept with the $f(x)$ axis. Weak duality implies that this point lies below (or at) the optimal value for the Lagrangian.

The hyperplanes will always be oriented with normal vectors pointing towards the first quadrant so we will have $d^\star = p^\star$ when the hyperplane is parallel to the $u$ axis. When $f$ is convex, $g_i$ are convex and $h_j$ are affine, we can define a set $A = \{(u, v, t) \;|\; u_i \ge g_i(x), v_j = h_j(x), t \ge f(x)\}$ for some $x \in \Re^m$ which includes all of the points “to the right” and “above” the set $g$.

** Theorem **
$A$ is convex.

** Proof **
Looking at $(u_1, v_1, t_1)$ and $(u_2, v_2, t_2)$, we need to show that $\exists x_1 \mathrm{\;s.t.\;} u_{1i} \ge g_i(x_2) \forall i, t_1 \le f(x_1), v_{1j} = h_j(x_1) \forall j$ and $\exists x_2 \mathrm{\;s.t.\;} u_{2i} \ge g_i(x_2) \forall i, t_2 \le f(x_2), v_{2j} = h_j(x_2) \forall j$. If we take $\gamma \in [0, 1]$ we need that $f(\gamma x_1 + (1 - \gamma)x_2) \le \gamma f(x_1) + (1 - \gamma)f(x_2) = \gamma t_1 + (1 - \gamma)t_2$ and $g_i (\gamma x_1 + (1 - \gamma)x_2) \le \gamma g_i (x_1) + (1 - \gamma) g_i (x_2) = \gamma u_{1i} + (1 - \gamma)u_{2i}$ and $h_j(\gamma x_1 + (1 - \gamma)x_2) = \gamma h_j (x_1) + (1 - \gamma h_j (x_2) = \gamma v_{1j} + (1 - \gamma) v_{2j}$. Now, $p^\star = \min \{t \;|\; (0, 0, t) \in A\}$.

We can construct a convex set $G$ which has a boundary and discontinuity (without the lower point included) along the $f(x)$ axis. For this problem, $d^\star \neq p^\star$ because $d^\star$ is the lower (non included) point of the discontinuity and $p^\star$ is the upper (included) point. Motivated by this issue, we can formulate the requirement that if $\exists \tilde{x} \mathrm{\;s.t.\;} g_i(\tilde{x}) < 0 \forall i$ (a strictly feasible point) and the original problem is convex, then strong duality holds. This is a constraint qualification condition called Slater's condition. More formally, consider the problem $\min f(x) \mathrm{\;s.t.\;} g_1(x) \le 0, \ldots, g_n(x) \le 0, Ax = b$ with $f, g_i$ convex, then if $\exists \tilde{x} \mathrm{\;s.t.\;} g_i(\tilde{x}) < 0 \forall i$, strong duality holds ($p^\star = d^\star$). Given $A = \{(u, v, t) \;|\; u \ge \vec{g}(x), v = \vec{h}(x), t \ge f(x) \}$ and $B = \{(0, 0, s) \;|\; s < p^\star \}$, $B$ is also convex.

If we don't assume convexity on our optimality conditions, but do assume that $d^\star = p^\star$, and consider $x^\star, (\lambda^\star, \nu^\star)$ are primal optimal and dual optimal respectively, we have $f(x^\star) = d(\lambda^\star, \nu^\star) = \min_x L(x, \lambda^\star, \nu^\star ) \le L(x^\star, \lambda^\star, \nu^\star) = f(x^\star) + \sum_i \lambda_i^\star g(x^\star) + \sum_j \nu^\star_j h_j(x^\star)$. Because $\lambda^\star \ge 0$, $g_i(x^\star) \le 0$ we have $f(x^\star) + \sum_i \lambda_i^\star g_i(x^\star) + \sum_j \nu^\star_j h_j(x^\star) \le f(x^\star)$ so $x^\star$ minimizes $L(x, \lambda^\star, \nu^\star)$ with respect to $x$. Further, we have $\sum_i \lambda_i^\star g_i(x^\star) = 0$ and therefor $\lambda_i^\star g_i(x^\star) = 0 \forall i$. So if $g_i(x^\star) < 0, \lambda_i^\star = 0$ or if $\lambda_i^\star > 0, g_i(x^\star) = 0$. This property is called complementarity or complementary slackness. So we have $g_i(x^\star) \le 0 \;\forall i, h_j(x^\star) = 0 \;\forall j, \lambda_i^\star \ge 0 \;\forall i, \lambda_i^\star g_i(x^\star) = 0 \;\forall i$ (primal feasibility, primal feasibility, dual feasibility, and complementary slackness). We also have $\nabla_x L(x^\star, \lambda^\star, \nu^\star) = 0 \rightarrow \nabla f(x^\star) + \sum_i \lambda_i^\star \nabla g_i(x^\star) + \sum_j v_j^\star \nabla h_j(x^\star) = 0$ because $x^\star$ minimizes $L$. All of these conditions are satisfied if strong duality holds and $f, g_i, h_j$ are differentiable and $x^\star$ is primal optimal and $\lambda^\star, \nu^\star$ are dual optimal. This set of conditions is called the Karush Kuhn Tucker or KKT conditions.

Suppose that $f$ is convex, $g_i$ is convex $\forall i$ and $h_j$ is affine $\forall j$ and that $\bar{x}, \bar{\lambda}, \bar{\nu}$ satisfies the KKT conditions. Then we have $f(\bar{x}) \ge d(\bar{\lambda}, \bar{\nu}) = \min_x L(x, \bar{\lambda}, \bar{\nu})$. Since we supposed that the problem is convex and $\bar{\lambda} \ge 0$, we have that $L(x, \lambda, \nu)$ is convex since it's the sum of convex functions so it's stationary at $\bar{x}$ so $\bar{x}$ minimizes $L$ over $x$. We also have that $L(\bar{x}, \bar{\lambda}, \bar{\nu}) = f(\bar{x}) + \sum_i \bar{\lambda_i}g_i(\bar{x}) + \sum_j \bar{\nu_j} h_j(\bar{x})$ so $f(\bar{x}) = d(\bar{\lambda}, \bar{\nu}) \le \max_{\lambda \ge 0} d(\lambda, \nu) = d^\star \le p^\star$ so $f(\bar{x} ) = p^\star$ so $\bar{x}$ is primal optimal and $\bar{\lambda}$ and $\bar{\nu}$ are dual optimal. So for a convex problem with strong duality (for example one that satisfies Slater's condition), the KKT conditions are necessary and sufficient for $x^\star, (\lambda^\star, \nu^\star)$ to be optimal for the primal and the dual. In addition, you “often” have the KKT conditions as a necessary condition for local minimality of $x^\star$ for general problems with differentiable $f, g_i, h_j$ (often means with additional conditions on $\nabla g_i, \nabla h_j$).

If we have a collection of vectors $x_1, \ldots x_n \in \Re^m$ with corresponding labels $y_1, \ldots y_n \in \{\pm 1\}$ we'd like to find a hyperplane which puts the points labeled $1$ on one side and $-1$ on the other side, $H = \{x \;|\; a^\top x = b \}$. To deal with the fact that it's OK for a few points to lie on the wrong side of the hyperplane, we add slack variables $\eta_i$ which reformulates the problem as $\min_{a, b, \eta} \frac{1}{2}||a||_2^2 \mathrm{\;s.t.\;} \forall i, \eta_i \ge 0, \forall i, y_i(x_i^\top - b) \ge 1 - \eta_i, \sum_i \eta_i \le k$. Most of the time people put the penalty of $\eta$ in the objective function giving $\min_{a, b, \eta} \frac{1}{2}||a||_2^2 + C\sum_i \eta_i \mathrm{\;s.t.\;} \forall i, \eta_i \ge 0, \forall i, y_i(x_i^\top a - b) \ge 1 - \eta_i$. This is our primal problem. We can formulate the Lagrangian as $L(a, b, \eta, \mu, \lambda)$ where $\mu \in \Re^n, \lambda \in \Re^n$. We can reformulate the constraints as $-\eta \le 0, -y_i x_i^\top a + y_i b - \eta_i + 1 \le 0$, giving $L(a, b, \eta, \mu, \lambda) =\frac{1}{2}||a||_2^2 + C\sum_i \eta_i + \sum_i \mu_i(-\eta_i) + \sum_i \lambda_i (-yx_i^\top a + y_i b - \eta_i + 1)$. We can obtain the dual by minimizing with respect to $(a, b, \eta)$, giving $\nabla_a L = a - \sum_i \lambda_i y_i x_i = 0$, $\nabla_b L = \sum_i \lambda_i y_i = 0$, $\nabla_{\eta_i} L = C - \mu_i - \lambda_i = 0$. Our three conditions are then that $a = \sum_i \lambda_i y_i x_i$, $\sum_i \lambda_i y_i = 0$, $\mu_i = C - \lambda_i$. Then we set $h = \frac{1}{2}||a||_2^2 - \sum_i \lambda_i \mu_i x_i^\top a + \sum_i \lambda_i$. Plugging in the stationarity conditions we have $h = \frac{1}{2}||\sum_i \lambda_i y_i x_i ||_2^2 - \sum_i \lambda_i y_ix_i^\top \sum_i \lambda_i y_i x_i + \sum_i \lambda_i$ so $L = -\frac{1}{2}\sum_i\sum_j \lambda_i \lambda_j y_i y_j x_i^\top x_j + \sum_i \lambda_i$. We can construct a matrix $Q = [y_1 x_1 \;|\; \ldots \;|\; y_n x_n], Q^\top Q = M$, giving $d(\mu, \lambda) = -\frac{1}{2} \lambda^\top M\lambda + \sum_i \lambda_i$ provided that the stationarity conditions hold.

Starting with the simplest optimization problem you might want to solve: $\min f(x) \mathrm{\;s.t.\;} x \in \Re^m$, assuming $f : \Re^m \rightarrow \Re$ is convex and “sufficiently nice” (twice continuously differentiable, $\nabla^2 f$ exists at every $x$ and is a continuous function of $x$). To solve this problem globally, it would be enough to find a point $x^\star$ where $\nabla f(x^\star) = 0$. If $f$ is a quadratic function $f(x) = x^\top Ax + b^\top x + c$ then we can solve the problem analytically. Often, we'll need some kind of iterative method which starts with some guess $x^{(0)}$ and then we'll generate a series of guesses $x^{(1)}, x^{(2)}, \ldots$. As the iteration becomes large $k \rightarrow \infty$, we hope that $f(x^{(k)}) \rightarrow p^\star$. As a prototype, given $x^{(0)}$, choose $\Delta x^{(k)}$ and $t^{(m)} \in (0, 1]$ and set $x^{(k + 1)} = x^{(k)} + t^{(k)} \Delta x^{(k)}$ until we converge. Generally we are not going to demand an exact solution, we'll just stop when $f(x^{(k)}) - p^\star \le \epsilon$.

A descent method has the property $\forall k, f(x^{(k + 1)}) < f(x^{(k)})$ - at each iteration we must “make progress”. Since $f$ is convex, we can also say $f(x^{(k + 1)}) \ge f(x^{(k)}) + \nabla f(x^{(k)})^\top (x^{(k + 1)} - x^{(k)}) < 0$ so $\nabla f(x^{(k)})^\top \Delta x^{(k)} < 0$. In general we need to specify how to set $t^{(k)}$ and the stopping condition. One choice would be to set $t^{(k)} = \mathrm{arg}\min_{t \in (0, 1]} f(x^{(k)} + t\Delta x^{(k)})$. This process is known as “line search”, were we approximately minimize $f$ over the line from $x^{(k)}$ to $x + \Delta x^{(k)}$.

In practice we usually use what's called an approximate line search which just looks for a value of $t$ which is good enough. A popular approach is backtracking line search, where you choose two parameters $\alpha \in (0, \frac{1}{2}), \beta \in (0, 1)$. Line searches address the fact that just making a local choice based on the function's derivative may not be the most effective approach. If we set $\bar{f}(t) = f(x^{(k)} + t\Delta x^{(k)})$ then we can approximate $\bar{f}(t) \approx f(x^{(k)}) + \nabla f(x^{(k)})^\top \Delta x^{(k)} t$. Instead of always moving based on the slope, we can make $\bar{f}_\alpha (t) = f(x^{(k)}) + \alpha \nabla f(x^{(k)})^\top \Delta x^{(k)}t$ so $f(x^{(k)} + t\Delta x^{(k)}) < \bar{f}_\alpha(t)$. We keep multiplying $t$ by $\beta$ until the stopping condition $f(x + t\Delta x) \le f(x) + \alpha t\nabla f(x)^\top \Delta x$ holds. The parameters $\alpha, \beta$ do not affect the computation time a lot.

Under strong convexity assumptions, $\nabla ^2 f(x) \ge mI, m > 0$ - the second derivative is bounded below. We can then take the usual inequality $f(y) \ge f(x) + \nabla f(x)^\top (y - x)$ to $f(y) \ge f(x) + \nabla f(x)^\top (y - x) + \frac{m}{2}||y - x||_2^2$ because $\exists z \in [x, y] \mathrm{\;s.t.\;} f(y) = f(x) + \nabla f(x)^\top (y - x) + \frac{1}{2} (y - x)^\top \nabla^2f(z) (y - z)$. The last term $\frac{1}{2} (y - x)^\top \nabla^2f(z) (y - z) \ge \frac{m}{2} ||y - x||_2^2$.

We want to know how “good” our solution is, and strong convexity can be used to bound $f(x) - p^\star$, the suboptimality of a point $x$. Minimizing, we have $p^\star = \min_y f(y) \ge \min_y f(x) + \nabla f(x)^\top (y - x) + \frac{m}{2} ||y - x||_2^2$. Differentiating the right hand side, $\nabla f(x) + m(y - x) = 0 \rightarrow y = x - \frac{1}{m} \nabla f(x)$. So $\min_y f(x) + \nabla f(x)^\top (y - x) + \frac{m}{2} ||y - x||_2^2$ $= f(x) - \frac{1}{m} ||\nabla f(x)||_2^2 + \frac{1}{2m}||\nabla f(x)||_2^2 = f(x) - \frac{1}{2m} ||\nabla f(x)||_2^2$. Then we have $f(x) - p^\star \le \frac{1}{2m} ||\nabla f(x)||_2^2$ which suggests that we should stop when $||\nabla f(x)|| \le \tau$. If we want to control the distance between $x$ and the optimum $x - x^\star$ then (using Cauchy Schwartz) $f(x^\star) \ge f(x) + \nabla f(x)^\top (x^\star - x) + \frac{m}{2}||x^\star - x||_2^2$ $\ge f(x) - ||\nabla f(x)||_2 ||x^\star - x||_2 + \frac{m}{2}||x^\star - x||_2^2$ and $0 \ge -||\nabla f(x)||_2 ||x^\star - x||_2 + \frac{m}{2} ||x^\star - x||_2^2$ and rearranging $||x^\star - x||_2 \le \frac{2}{m} ||\nabla f(x)||_2$.

If we evaluate the inequality at the optimal $x^\star$, at any $x$ we have $f(x) \ge f(x^\star) + \nabla f(x^\star)^\top (x - x^\star) + \frac{m}{2} ||x^\star - x||_2^2 = f(x^\star) + \frac{m}{2}||x^\star - x||_2^2$. This implies that the level sets of $f$ are bounded (level sets are $S_\alpha = \{x \;|\; f(x) \le \alpha\}$). If $x \in S_\alpha$ then $\frac{m}{2} ||x - x^\star||_2^2 \le \alpha - p^\star$. If the algorithm is a descent method then every iterate $x^{(k)}$ lies in the set $S = \{x \;|\; f(x) \le f(x^{(k)})\}$. If we look at the largest eigenvalue of $\nabla^2 f(x)$ (a continuous function of $f$), under our assumptions it has a maximum $M$ where $\exists M \mathrm{\;s.t.\;} \forall k, MI \succeq\nabla^2 f(x^{(k)})$. This implies that $f(y) \le f(x) + \nabla f(x)^\top (y - x) + \frac{M}{2}||y - x||_2^2$. This indicates that there is some convex upper and lower bound which ensures that the function does not change too quickly. If we look at the level set $S_\alpha$, from strong convexity we have $S_\alpha$ has to be contained in a sphere centered around $x^\star$. This sphere will have size $\sqrt{\frac{\alpha - p^\star}{m}}$. It's also bounded inside by a sphere with radius $\sqrt{\frac{\alpha - p^\star}{M}}$. The ratio of the radii $\frac{m}{M}$ controls how eccentric $S_\alpha$ can be. Generally if $\frac{m}{M} \approx 1$ it will be “good”, if $\frac{m}{M} \approx 0$ this is “bad” because the level set indicates how fast the function curves. For example, when we set $\Delta x = -\nabla f(x^{(k)})$ (gradient descent) we will spend a lot of time oscillating back and forth as we try to approach the minimum because the gradient will only point a little bit in one direction towards the minimum and a lot in the other.

If we define$\bar{f}(t) = f(x - t\nabla f(x))$ ($f$ as a function of the step length in the negative gradient direction) and assume strong convexity, we have $\bar{f}(t) \le f(x) + \nabla f(x)^\top(-t\nabla f(x)) + \frac{Mt^2}{2}||\nabla f(x)||_2^2$ $= f(x) - t||\nabla f(x)||_2^2 + \frac{Mt^2}{2} ||\nabla f(x) ||_2^2$. Exact line search would choose a $t_{ex}$ which minimizes $\bar{f}(t)$, so that $\bar{f}(t_{ex}) \le \min_t f(x) - t||\nabla f(x)||_2^2 + \frac{Mt^2}{2} ||\nabla f(x)||_2^2$. We can instead minimize $g(t) = -t + \frac{Mt^2}{2}$; differentiating and setting equal to zero we have $g^\prime = 0 \rightarrow -1 + Mt = 0$ giving $t = \frac{1}{M}$. We then have $\min_t f(x) - t||\nabla f(x)||_2^2 + \frac{Mt^2}{2} ||\nabla f(x)||_2^2 = f(x^{(k)}) - \frac{1}{2M} ||\nabla f(x^{(k)}) ||_2^2$. The suboptimality of some point is then $f(x^{(k + 1)}) - p^\star \le f(x^{(k)}) - p^\star - \frac{1}{2M} ||\nabla f(x^{(k)})||_2^2$. The size of the gradient is an upper bound on the suboptimality at any point in time $||\nabla f(x)^{(k)})||_2^2 \ge 2m( t(x^{(k)}) - p^\star )$, $f(x^{(k + 1)}) - p^\star \le (f(x^{(k)}) - p^\star)(1 - \frac{m}{M})$. If $m$ is close to $M$ then the coefficient of contraction $(1 - \frac{m}{M})$ will be close to zero so we make progress quickly. By iterating this inequality, we have $f(x^{(k)}) - p^\star \le (1 - \frac{m}{M})^k (f^{(0)} - p^\star)$.

What should we do when $m/M$ is small? Instead of choosing the gradient as the direction, we should choose some other direction which is “turned inward” towards the center of the level set. In steepest descent, we set $\Delta x = \mathrm{arg}\min \{ v^\top \nabla f(x) \;|\; ||v|| \le 1\}$. If this norm is just the L2 norm, the solution is just $v = -\nabla f(x)/||\nabla f(x)||$. But we may be able to get something nice out if we use a different norm.

We can use the L1 norm for steepest descent $||v||_1 = \sum_i |v_i|$. The unit ball of the L1 norm looks like a diamond; we will minimize the gradient with the inner product over this set. This is picking out the largest coordinate of the gradient $i_\star = \mathrm{arg}\max |\partial f/\partial x_i| = -\mathrm{sgn}(\partial f/\sqrt{x_{i_\star}} )e_{i_\star}$. This is coordinate descent - we choose the single coordinate and move in the direction of that basis vector.

An ellipse is just a linear transformation of the circle, so we can make the level set more circular by using a norm which transforms to a space where the level set is circular. We can choose a norm $||x||_Q = \sqrt{x^\top Q x}, Q \succeq 0$ (note that $||x||_2 = ||x||_I$). Then our step is $\Delta x = \mathrm{arg}\min_v v^\top \nabla f(x) \mathrm{\;s.t.\;} v^\top Qv \le 1$. We can write down the Lagrangian $L(v, \lambda) = v^\top \nabla f + \lambda( v^\top Qv - 1)$. Differentiating, $\nabla_v L(v, \lambda) = 0 \rightarrow \nabla f(x) + 2\lambda Q v = 0 \rightarrow v = -\frac{\nabla f(x)}{2\lambda Q^{-1}}$. The scale factor $\Delta x$ should be $\frac{-Q^{-1}\nabla f(x)}{\sqrt{\nabla f^\top Q^{-1} \nabla f(x)}}$. Instead of moving along the direction of the negative gradient, we move along the direction of the negative gradient transformed by $Q^{-1}$. The steepest descent method in the quadratic norm $|| \cdot ||_Q$ can be thought of as the gradient method applied to the problem after the change of coordinates $\bar{x} = Q^{1/2}x$. We want to choose $Q$ to transform the level set into a sphere. In some cases (like a quadratic function $f$) we could solve explicitly for $Q$.

In general, we could choose our next point by looking locally at the location of our iterates. In particular, the gradient may not be enough information - we want to know how the gradient is changing locally, so we can look at the Hessian $\nabla^2 f$. If we set $Q = \nabla ^2 f(x)$, giving $\Delta = -(\nabla^2 f(x) )^{-1} \nabla f(x)$ (the Newton step). We can compute $\hat{f}(x + \Delta) = f(x) + \Delta^\top \nabla f(x) + \frac{1}{2} \Delta^\top \nabla^2 f(x) \Delta$. To minimize we must have $\nabla f(x) + \nabla^2 f(x) \Delta = 0 \rightarrow \Delta = -(\nabla ^2 f(x))^{-1} \nabla f(x)$. If it happens that $f$ is quadratic ($f(x) = x^\top Ax + b^\top x + c$), the usual gradient algorithm with this Newton step will terminate after one iteration. More generally, if $f$ is not quadratic then the iterative algorithm with $\Delta x = -(\nabla ^2 f)^{-1} \nabla f$ is affine invariant (if we choose invertible $L$ and set $y = Lx$, $g(y) = f(L^{-1}y) = f(x)$ then $Ly = L\Delta x$). So the algorithm will converge well for both the spherical and ellipsical level set cases.

Assume $\nabla^2 f \succeq mI$ and $\nabla^2 f(x)$ is Lipschitz with respect to $x$ (that is $||\nabla^2 f(x) - \nabla^2 f(y)|| \le L||x - y||$). Then $\exists \eta > 0, \gamma > 0, \eta < \frac{m^2}{L}$ such that either $||\nabla f(x)|| \ge \eta \rightarrow f(x^{(k+1)}) - f(x^{(k)}) \le -\gamma$ or $||\nabla f(x)|| \le \eta \rightarrow \frac{L}{2m^2}||\nabla f(x^{(k+1)}) || \le (\frac{L}{2m^2} ||\nabla f(x^{(k)})||)^2$. Since $\eta \le \frac{m^2}{L}$ we have $||\nabla f(x^{(k+1)}) \le \frac{m^2}{2L} ||\nabla f(x^{(k)})||_2^2 < \frac{1}{2}||\nabla f(x)^{(k)}||_2$. In other words, once the gradient gets small, it stays small. If we suppose $||\nabla f(x^{(\lambda)})|| \le \eta$ then $k > \lambda$, $\frac{L}{2m} ||\nabla f(x^{(k)})||_2 \le (\frac{L}{2m^2} ||\nabla f(x^{(\gamma)})||_2)^{2^{k-\gamma}}$. We can bound $f(x^{(k)}) - p^\star \le \frac{1}{2m} ||\nabla f(x^{(k)})||^2$. Once the second condition starts happening (the second order approximation is good), we are “almost there” - we will converge soon, we are in the quadratically convergent phase. Before that, we're in the damped Newton phase.

When $||\nabla f|| \le \eta$, $f(x + t\Delta x) \le f(x) + t\Delta^\top \nabla f(x) + \frac{Mt^2}{2}||\Delta x||_2^2$. We can define $\lambda(x)^2 = \nabla f(x)^\top (\nabla^2 f(x))^{-1} \nabla f(x)$. After manipulation, $\lambda(x)^2 = \Delta x^\top \nabla^2 f(x)\Delta x \ge m||\Delta x||_2^2$. We can then say $f(x + t\Delta x) \le f(x) - t\lambda^2(x) + \frac{Mt^2}{2m} \lambda^2(x)$. If we choose $\hat{t} = \frac{m}{M}$ then $\hat{t}$ satisfies the backtracking condition. So $f(x + t\Delta x) \le f(x) - \frac{m}{2M} \lambda^2(x) \le f(x) - \alpha \hat{t}\lambda^2(x)$. Our actual $t$ is at least $\beta \frac{m}{M}$. So we know $f(x + t\Delta x) \le f(x) - \alpha \beta \frac{m}{M} \lambda^2(x)$ and $\Delta x^\top \nabla^2 f(x) \Delta x \ge m||\Delta x||_2^2 = m ||(\nabla^2 f(x))^{-1} \nabla f||_2^2$. So finally $f(x + t\Delta x) \le f(x) - \beta \frac{m^2}{M^2} \eta^2$. For the other case, if $||\nabla f(x)|| \le \eta$ and $\eta \le 3(1 - 2\alpha)\frac{m^2}{L}$ then the line search always chooses $t = 1$. So $\nabla ||f(x^{(k+1)})||$ can be written as $||\nabla f(x^{(k)}) + \int_0^1 \nabla^2 f(x + t\Delta x) \Delta x \partial t||$. Using $\Delta x = -(\nabla^2 f(x))^{-1} \nabla f(x) \rightarrow \nabla f(x) = -\nabla^2 f(x) \Delta x$, we have $\nabla ||f(x^{(k+1)})|| = ||\int_0^1 \nabla^2 f(x + t \Delta x) - \nabla^2 f(x) \Delta x \partial t|| \le \int_0^1 t L ||\Delta x|| ||\Delta x|| \partial t$ $ = \frac{L}{2}||\Delta x||_2^2 = \frac{L}{2}||(\nabla^2 f)^{-1} \nabla f||_2^2 \le \frac{L}{2m^2} ||\nabla f(x)||_2^2$.

For very large steps, we can't compute the matrix inverse in Newton's method. A heavy ball method will use memory about previous steps to make better steps and avoid chatter, modifying the update to $x^{(k)} = x^{(k-1)} - t\nabla f(x^{(k-1)}) + s(x^{(k-1)} - x^{(k-2)}$.

Some problems are so large-scale that we can't even compute a gradient. Instead, we can compute some kind of a random vector $G(x, \xi)$ which we can use in place of the gradient. If we have strong convexity and the gradient estimates are bounded in the mean-squared sense, then this will still converge although at a slower rate. We can make progress by only looking at individual data samples rather than the entire dataset. We need to know the strong convexity parameter $t_k$, if you choose the wrong one it can work very badly.

We can make two changes to this algorithm. First, work with a different set of $t_k$ to make it converge faster by setting $t_k = \frac{\Theta}{\mu\sqrt{k}}$ where $\Theta$ is a constant we choose and $\mu$ is an upper bound on the size of the gradient estimate. We are taking slightly longer steps that go as $1/\sqrt{k}$. This will introduce more noise in our estimates $x_k$. To avoid this, we smooth the sequence of iterates by setting $\bar{x}_k = \frac{ \sum_{i = 1}^k t_i x_i }{ \sum_{i = 1}^k t_i }$ - we take a running average to hopefully get a better estimate. Then, if we look at the function value of the smoothed estimate we'll have $\mathbb{E}[ f(\bar{x_k}) - f(x^\star)] \rightarrow 0$ at a rate of about $\frac{ \log(k) }{ \sqrt{k} }$ which is even slower but allows us to make progress. If we consider iteration $i$, we showed that $t_i \mathbb{E}_{x_i} [ (x_i - x_\star)^\top \nabla f(x_i) ] \le \epsilon_i - \epsilon_{i + 1} + \frac{1}{2} t_i^2 + \mu^2$. We know that $f(x_\star) \ge f(x_i) + \nabla f(x_i)^\top (x_\star - x_i)$ so we have that $t_i \mathbb{E} [f(x_i) - f(x_\star)] \le \epsilon_i - \epsilon_{i + 1} + \frac{1}{2} t_i^2 \mu^2$. If we sum both sides, we have $\sum_{i = 1}^k t_i \mathbb{E}[f(x_i)] - \sum_{i = 1}^k t_i f(x_\star) \le \epsilon_i + \frac{1}{2}\mu^2 \sum_{i = 1}^k t_i^2$. Dividing, we have $\frac{ \sum_{i = 1}^k t_i \mathbb{E}[f(x_i)] }{\sum_{i = 1}^k t_i } - f(x_\star)$ $\le \frac{ \epsilon_i + \frac{1}{2}\mu^2 \sum_{i = 1}^k t_i^2 }{ \sum_{i = 1}^k t_i }$. The expectation is linear so we can pull it out. In the $k = 2$ case, we have $\mathbb{E} \left[ \frac{\sum_{i = 1}^2 t_i f(x_i)}{\sum_{i = 1}^2 t_i} \right] = \lambda f(x_1) + (1 - \lambda)f(x_2)$ where $\lambda \frac{ t_1 }{ t_1 + t_2 }$ and $f(\bar{x}) = f(\lambda x_1 + (1 - \lambda)x_2 )$. So using convexity, $\frac{ \sum_{i = 1}^k t_i \mathbb{E}[f(x_i)] }{\sum_{i = 1}^k t_i } - f(x_\star) \ge \mathbb{E} [ f(\bar{x}) ]$ so $\mathbb{E}[f(\bar{x}_k)] - f(x_\star) \le \frac{\epsilon_1 + \frac{1}{2}\mu^2 \sum_{i = 1}^k t_i^2} {\sum_{i = 1}^k t_i}$. We also have the approximations $\sum_{i = 1}^k t_i^2 = \sum_{i = 1}^k \frac{\Theta^2}{\mu^2 i} \le \frac{\Theta^2}{\mu^2} \log (k + 1)$ and $\sum_{i = 1}^k t_i = \frac{\Theta}{\mu} \sum_{i = 1}^k \frac{1}{\sqrt{i}} \ge \frac{\Theta}{\mu} \sqrt{k}$. So $\mathbb{E}[f(\bar{x}_k)] - f( x_\star ) \le \mu \left( \frac{ \frac{ \epsilon_1}{\Theta} + \frac{1}{2} \Theta \log( k + 1 )}{\sqrt{k}} \right)$ which goes as $\frac{ \log(k) }{ \sqrt{k} }$. These algorithms are very slow but give solutions to very large problems.

Say we'd like to solve $\min f(x)$ subject to $x \in C$ where $C$ is a convex set. Many of the algorithms we've seen so far actually carry over very naturally to this setting if it's possible to compute projections onto the set $C$. A projection is $||x - x^\prime||_2^2$ such that $x^\prime \in C$ - the unique closest point in $C$. If $C$ is a closed convex set then we can show that the closest point does exist and it does have to be unique. To make notation, $P_C (x) = \mathrm{arg}\min ||x - x^\prime||_2^2$ st $x^\prime \in C$. So, to use the algorithms described above, we just take a gradient step and then project back onto $C$: $x_{k + 1} = P_C( x_k - t_k \nabla f(x_k) )$. If $C$ is structured such that $P_C$ is cheap, we can just repeatedly do this projection. However, we need to solve a sequence of optimization problems in order to solve the initial optimization problem. Most of the algorithms and their convergence guarantees carry over if projection onto $C$ is easy.

Given the problem $\min f(x)$ such that $g_1(x) \le 0 \ldots g_m(x) \le 0, Ax = b$, it's not obvious how we would project back onto the constraint set. Recalling Newton's method, we set $\Delta x = -\nabla^2 f(x)^{-1} \nabla f(x)$ searching for the optimal solution which satisfies $\nabla f(x) = 0$, or $\nabla f(x + \Delta) = 0$ for some small $\Delta$. The Newton step is $\nabla f(x) + \nabla^2 f(x) \Delta = 0$. The KKT conditions require that $Ax = b, g_i(x) \le 0 \forall i$, $\lambda_i \ge 0 \forall i$, $\nabla f(x) + \sum_i \lambda_i \nabla g(x) + A^\top v = 0$, and $\forall i, \lambda_i g_i(x) = 0$. Some of the fastest converging algorithms for the general problem try to get a solution to these optimality conditions. If we take the conditions and ignore the inequality constraints, then we have a system of three equations in three unknowns $x, \lambda, \nu$, which we can write as the functions $r( x, \lambda, \nu)$. We'd like to find $x, \lambda, \nu$ such that $r( x, \lambda, \nu ) = 0$, which we can do by setting $\frac{\partial r(x, \lambda, \nu)}{ (x, \lambda, \nu) } [\Delta x, \Delta \lambda, \Delta \nu ] = -r( x, \lambda, \nu )$. Linearizing the optimality condition gives a big system of equations which depends on the Hessian of $f$ and the constraints.

Instead of trying to force $\lambda_i g_i(x) = 0$ at each step, we can enforce that $\lambda_i g_i(x) = -\frac{1}{t_k} \forall i \in \{1, \ldots m\}$, where $t_k$ is a relaxation parameter. If $\lambda_i$ and $g_i(x)$ are feasible then $\lambda_i \ge 0$, $g_i(x) \le 0$. If $t_k$ is big we're close to satisfying $\lambda_i g_i(x) = 0$. This gives a primal dual interior point method. This gives a good solution to a small problem very quickly.

If it's too complicated to solve this big system of equations, we can use duality and solve the dual instead.

Looking at a simplified version of our general optimization problem $\min f(x)$ st $Ax = b$ (you can easily add in inequality constraints). The Lagrangian is $L(x, \nu) = f(x) + \nu^\top (Ax - b)$. We define the dual function as $d( \nu) = \mathrm{inf}_x L(x, \nu)$. We can describe a conceptually simple iteration involving two steps (one lightweight or heavy, one light): $x_{k+1} = \mathrm{arg}\min_x L(x, \nu_k), \nu = \nu_k + t_k(Ax_{k+1} - b)$. In the first step, we minimize $L$, giving $L(x_{k+1}, \nu_k) = d(\nu_k)$. Then, we add back a multiple of the violation of the constraint (the error) because $\nu_k$ is a penalty for violating the constraint. More precisely, if the minimizer $x_{k+1}$ is unique, then $d(\cdot)$ is differentiable at $\nu_k$ and $\nabla d(\nu_k)$. This suggests gradient ascent on the dual problem - or dual ascent. In the course of computing a maximizer of the dual, we are actually generating primal points. A primal optimal solution is given by $x_{\star} = \mathrm{arg}\min L(x, \nu_k)$. The function $d(\nu)$ at a point $\nu_k$ has a best affine approximation which is a hyperplane $d(\nu_k) + \langle \nabla d(\nu_k), \nu - \nu_k \rangle$. This linearization is the only linearization at this point which lies above the graph ($d$ is convex). If we think about $d$ at some other value $\nu^\prime$, we have $d(\nu^\prime) = \min_x L(x, \nu^\prime) = \min_x f(x) + \nu^{\prime\top}(Ax - b) \le$ $f(x_{k+1}) + \nu^{\prime\top}(Ax_{k+1} - b) = d(\nu_k) + \langle \nu^\prime - \nu_k, Ax_{k+1} - b \rangle$. This shows that $Ax_{k+1} - b$ is always a subgradient, and if $d$ is a gradient then it gives the unique hyperplane above the graph.

Maximizing the dual could be easier to solve (eg with linear classification). If the objective function $f$ is separable $f(x) = \sum_{i = 1}^N f_i(x_i)$ where the $x_i$s make up $x$ and we write $A = [A_1 \;|\; \ldots \;|\; A_N]$ we have $L(x, \nu) = \sum_I f_i(x_i) + \nu^\top A_i x_i - \nu^\top b$. We can actually now solve the minimization problems in parallel. However, this could cause problems. For example, if $f(x) = q^\top x$, then $L(x, \nu) = q^\top x + \nu^\top (Ax - b)$. More generally, if $L(x, \nu)$ can be unbounded below in $x$ then we run into trouble because $\nu_k$ can become dual infeasible ($d(\nu_k) = -\infty$).

We started out by solving $\min f(x)$ st $Ax = b$; we could imagine we solve an equivalent problem where we minimize $f(x) + \frac{\rho}{2}||Ax - b||_2^2$ st $Ax = b$. This will change the algorithm but not the solution. The Lagrangian is now $L_\rho( x, \nu ) = f(x) + \langle \nu, Ax - b \rangle + \frac{\rho}{2}||Ax - b||_2^2$. The additional term causes us to not stray too far from the feasible set. We call this an “augmented Lagrangian”. In dual ascent, we now have $x_{k + 1} = \mathrm{arg}\min_x L_\rho( x, \nu_k )$ and $\nu_{k + 1} = \nu_k + \rho(Ax_{k+1} - b)$, a “method of multipliers”. Minimizing the first step gives $0 = \nabla_x L_\rho (x_{k+1}, \nu_k) = \nabla f(x_{k+1}) + A^\top \nu_k + \rho A^\top (Ax_{k+1} - b) =$ $\nabla f(x_{k+1}) + A^\top[\nu_k + \rho (Ax_{k+1} - b)] = \nabla f(x_{k+1}) + A^\top \nu_{k+1} = \nabla_x L(x_{k+1}, \nu_{k+1})$ so $x_{k+1}$ minimizes $L(x, \nu_{k+1})$ so $d(\nu_{k+1}) = L(x_{k+1}, \nu_{k+1}) > -\infty$. So, we no longer have the issue where we are unbounded below. This can also be interpreted as a better answer to a penalty method - solving $\min f(x)$ st $Ax = b$ using a sequence of problems $\min f(x) + \frac{\rho_k}{2} ||Ax - b||_2^2$ sending $\rho_k$ to $\infty$. This works provably, but the problems get harder to solve as $\rho_k$ gets bigger. The augmented Lagrangian avoids this.

If $f(x) = \sum_i f_i(x_i)$, then the updates for dual ascent separate. By adding the augmented Lagrangian term $||Ax - b||_2^2$ moving from $L$ to $L_\rho$, we lose the separability, but we can get it back for some special cases. If $f(x)$ separates over two terms, then our problem is $\min f(x) + g(z)$ st $Ax + Bz = b$, giving $L_\rho(x, z, \nu) = f(x) + g(z) + \nu^\top (Ax + Bz - b) + \frac{\rho}{2}||Ax + Bz - b||_2^2$ and $x_{k+1} = \mathrm{arg}\min_x L(x, Z_k, \nu_k)$, $z_{k+1} = \mathrm{arg}\min L_\rho(x_{k+1}, z, \nu_k)$, $\nu_{k+1} = \nu_k + p(Ax_{k+1} + Bz_{k+1} - b)$. An example is the consensus problem where we want to minimize $f(x) = \sum_{i=1}^N f_i(x)$. For example, $f_i$ could be a loss on a chunk of data. If we reformulate from $x \in \Re^d$ to $x_i \in \Re^d, i \in \{1, \ldots N\}$ and instead solve $\min \sum_{i = 1}^N f_i(x_i)$ st $x_1 - z = 0, x_2 - z = 0, \ldots x_N - z = 0$. The Lagrangian is then $L_\rho(x_i, \ldots x_N, z, \nu_i, \ldots \nu_N) = \sum_{i = 1}^N f(x_i) + \nu_i^\top (x_i - z) + \frac{\rho}{2}||x_i - z||_2^2$. To use ADMM, we use variables $\tilde{x} = (x_1 \;|\; \ldots \;|\; x_N)$ and $z$, so that $x_{k+1} = \mathrm{arg}\min_{x_i} f_i(x_i) + \nu_i^\top (x_i - z) + \frac{\rho}{2}||x_i - z||_2^2$ and $z_{k+1} = \frac{1}{N} \sum_{i = 1}^N (x_i + \frac{1}{\rho} \nu_i)$ and $\nu_{i, k+1} = \nu_i + \rho( x_{k+1} - z_{k+1} )$. Matrix decomposition problems (breaking $D = X_1 + X_2$) also fall into this framework by solving $\min f_1(x_1) + f_2(x_2)$ st $X_1 + X_2 = D$.

We have previously shown that the rank-$r$ approximation $\hat{X}$ constructed by truncating the SVD of $X$ is optimal in the frobenius norm sense - $||\hat{X} - X||_F$ is minimized. This makes sense when the data is “clean” - in addition to having good low dimensional structure, the data does not have some other non-idealities. These problems can be manifested as errors in the samples or errors in the data features (a few errors in each of the observations). If the error is sufficiently large, you can create any arbitrary change in the PCA/SVD approximation.

We can formulate this problem via optimization: $\min_S \sum_{i=1}^N ||x_i - P_s x_i||_2^2$, where $P_S$ is a projection matrix onto $S$ of dimension d. We can write $P_S \succeq 0$ and $P_S = U_S U_S^\top$, where $U_S$ is a matrix whose columns are an orthonormal basis for $S$. We can also say that its eigenvalues are $0$ or $1$, that is $\lambda_i(P_S) \in \{0, 1\}$, so that $\mathrm{tr}[P_S] = r$. Denoting $P = P_S$, we can minimize over $P$ with $\mathrm{rank}(P) = r, \mathrm{tr}[P] = r, \lambda_i(P) \in \{0, 1\}, P \succeq 0$. This is not a convex problem because $P$ has to live in the discrete set constrained by its eigenvalues. We can instead require $\lambda_i(P) \in [0, 1]$ or equivalently $0 \preceq P \preceq I$, which produces a convex problem. It's still trying to minimize squared error, which is problematic for large errors - a large error will be assigned a very large cost (L2 norm is quadratic). So, we'd like to reformulate it to better handle outlying points by making the objective $\min \sum_{i=1}^N ||x_i - Px_i||_2$ which is still convex. A projected gradient algorithm would require that we solve $\min ||P^\prime - P||_F^2 \mathrm{\;s.t.\;} 0 \preceq P^\prime \preceq I$. Using the eigenvalue decomposition $P = U\Lambda U^\top$, we have $||P^\prime - U \Lambda U^\top||_F^2$ so we could equivalently minimize $||P^{\prime\prime} - \Lambda||_F^2$ where $P^{\prime\prime} = U^\top P^\prime U$ and $0 \preceq P^{\prime\prime} \preceq I$. This gives us robustness to errors in samples.

The PCA or SVD problem is solving a rank $d$ approximation problem - $\min_{\mathrm{rank}(\hat{X}) = d} ||\hat{X} - X||_F^2$ or $\min \mathrm{rank}(\hat{X}) \mathrm{\;s.t.\;} ||\hat{X} - X||_F \le \epsilon$. The error in features is orthogonal to the underlying subspace of the data, so we can represent it as $X = X_0 + E_0$. We can then solve the problem $\min \mathrm{rank}(\hat{X}) \mathrm{\;s.t.\;} ||\hat{X} + E - X||F \le \epsilon$. Here $E$ is unconstrained so we could easily set it to $X$ and set $\hat{X} = 0$, so we need to put a penalty on the number of nonzero entries in the error $||E||_0 \le k$. This can be included in the objective function as $\min \mathrm{rank}(\hat{X}) + \gamma ||E||_0 \mathrm{\;s.t.\;} ||\hat{X} + E - X||_F \le \epsilon$. The $\mathrm{rank}$ and $||\cdot||_0$ operators are both nonconvex. However, we can at least make a bound on the objective function - we can compute a convex envelope, the biggest convex function which sits everywhere underneath the original function. For $||\cdot||_0$, this convex envelope is $||\cdot||_1$. For $\mathrm{rank}$, we can replace with $||X||_\ast = \sum_{i = 1}^n \sigma_i(X)$ where $\sigma_i$ are the singular values. This is called the Nuclear norm (it is a norm) which is convex but is not smooth. So, our objective has been reformulated as $||\hat{X}||_\ast + \gamma||E||_1 \mathrm{\;s.t.\;} ||\hat{X} + E - X||_F \le \epsilon$. We are approximating $\hat{X} + E = X$ while requiring that the approximation is low-rank, and that the error is sparse. If this structure is actually satisfied and the approximation is sufficiently low-rank with sufficiently sparse error, we can prove that the solution to the problem is actually correct. We can solve this problem with ADMM.

We can assume that our observed $X$ is actually our desired data plus some low-rank error and some sparse error, projected onto some space, as $P_\Omega[X] = P_\Omega[X_0 + E_0 + C_0]$, where $P_\Omega$ projects onto some lower dimensional space. We can try to fit the data to $P_\Omega[\tilde{X} + \tilde{E} + \tilde{C}]$ using ideas from above, by minimizing $\min ||\hat{X}||_\ast + \gamma_1||E||_1 + \gamma_2\sum_i ||\tilde{C}e_i||_2 \mathrm{\;s.t.\;} ||P_\Omega[\tilde{X} + \tilde{E} + \tilde{C}] - P_\Omega[X]||_F^2 \le \epsilon^2$. Hopefully, if everything has the right structure, this will give us the right answer.

Given data $y = [y_1 | \ldots | y_p] \in \Re^{m \times p}$, it may be that the data doesn't fall on one subspace - some points may fall on one subspace and others may fall on another. For example, patches of natural images tend to exhibit clusters/subspaces in a low-dimensional space based on the color and/or texture. Determining this structure can help in higher-level processing like segmentation. Formally, given subspaces $S_1, \ldots S_k \in \Re^m$ and sample matrices $y_1, \ldots, y_k$, $y_i \in \Re^{m \times n_i}$ with $\mathrm{range}(y_i) \subseteq S_i$, we seek to cluster/segment the subspaces of $Y = [y_1 \ldots y_k]$. There's no reason the eigenvalues/eigenvectors of the covariance would be helpful here - we'd rather encapsulate the algebraic structure of the data. If $U_i$ is a basis for $S_i$, we can write $y_i = U_i x_i$ for some coefficient vector $x_i$. We don't know the subspaces ahead of time, so we don't have a basis. We can instead try selecting a basis from the samples themselves - look for a representation $y_i = Yc_i$, with $c_{ii} = 0$ so that no point is used to describe itself. Note that $\mathrm{Null}(Y)$ is big because it's a big data matrix whose entries lie on a lower dimensional subspace. So, we should only need a few of the examples to reconstruct each of the inputs - we can pick a relevant basis for each $S_i$ by selecting only a few $y_i$. Formally, we can solve $\min_{c} ||c_i||_0 \mathrm{\;s.t.\;} y_i = Yc_i, c_{ii} = 0$. Convexifying gives $\min_{c_i} ||c_i||_1 \mathrm{\;s.t.\;} y_i = Yc_i, c_{ii} = 0, \forall i = 1, \ldots, p$. Denoting $C$ as the matrix of $c_i$ gives $\min_C ||C||_1 \mathrm{\;s.t.\;} Y = YC, \mathrm{diag}(C) = 0$. If the data is noisy, we may instead want to solve $\min_C ||C||_1 + \frac{\beta}{2} ||Y - YC||_F^2 \mathrm{\;s.t.\;} \mathrm{diag}(C) = 0$. This is a convex optimization problem - the constraint is linear.

In solving this problem, we need to be careful because $||\cdot||_1$ is not strongly convex and because of the constraint. So, we can exploit the fact that $\min ||C||_1 + \frac{\gamma}{2}||C - H||_F^2$ can be solved in closed form, giving $C_{ij} = \mathrm{sgn}(H_{ij})\max(|H_{ij}| - \gamma, 0)$ (this property of the L1 norm is not uncommon). We can use ADMM to put this problem in this form. We can decouple $C$ in the objective function by adding a variable and a constraint $\min_C ||C||_1 + \frac{\beta}{2} ||Y - YZ||_F^2 \mathrm{\;s.t.\;} \mathrm{diag}(C) = 0, C = Z$. The Lagrangian is then $L_\rho(C, Z, \Lambda) = ||C||_1 + \frac{\beta}{2}||Y - YZ||_F^2$ $ + \langle \Lambda, C-Z \rangle + \frac{\rho}{2}||C - Z||_F^2 + I(\mathrm{diag}(C)) = 0$. This last function $I$ is $\infty$ when its argument is false, and $0$ when its true - it just blows up when the constraint is violated to ensure that the constraint stays satisfied in any solution. ADMM is then minimizing $L_\rho$ with respect to $C$, then minimizing with respect to $Z$ each of which we can solve nearly in closed form.

Given this algorithm, we'd like to know how “good” it is. We should ask when $C$ is block diagonal - this indicates that each point is the sum of points within its subspace. The answer is “often”, if the data follows the model. We can evaluate this by first solving the main problem to produce $C_\star$, then solving it for each $y_1, \ldots, y_k$ and seeing whether the blocks of $C_\star$ correspond to $C_{\star i}$, the solutions of these smaller problems. This can be formulated as an algebraic condition - an “independence” assumption that $\mathrm{dim}(S_1 + \ldots + S_k) = \sum_{i + 1}^k \mathrm{dim}(S_i)$ where $S_i$ are independent subspaces - there are no linear dependencies between the subspaces. We can show this using duality. Given the problem $\min_C ||C||_1 \mathrm{\;s.t.\;} Y + YC, \mathrm{diag}(C) = 0$ we can instead solve $\min_{C, B} ||C||_1 \mathrm{\;s.t.\;} Y = YC, \mathrm{diag}(C) = 0, C_{ij} \le B_{ij}, -C_{ij} \le B_{ij}$. If we solve $\min_B B_{ij} \mathrm{\;s.t.\;} C_{ij} \le B_{ij}, -C_{ij} \le B_{ij}$ we will be finding the best $C_{ij}$. We can then modify our objective to $\min_{C, B} \langle 1, B \rangle \mathrm{\;s.t.\;} Y = YC, \mathrm{diag}(C) = 0, C_{ij} \le B_{ij}, -C_{ij} \le B_{ij}$. We are “hiding bad properties of the objective function (L1 norm) in the constraints”. The Lagrangian is then $L(C, B, \Gamma_1, \Gamma_2, \Lambda_1, \lambda) = \langle 1, B \rangle + \langle \Lambda, YC - Y \rangle$ $+ \langle \lambda, \mathrm{diag}(C) \rangle + \langle \Gamma_1, C - B \rangle + \langle \Gamma_2, C - B \rangle$. For KKT, we need $\Gamma_1, \Gamma_2 \ge 0$ and $0 = \nabla_B L = 1 - \Gamma_1 - \Gamma_2$ and $0 = \nabla_C L = y^\top \Lambda + \mathrm{diag}(\lambda) + \Gamma_1 - \Gamma_2$.

Given a bunch of examples $X = [x_1, \ldots, x_n] \in \Re^m$, we'd like to be able to solve a bunch of problems - clustering, nearest neighbor, etc. We know that some of the data points are similar and other sare dissimilar (a labeling) - two sets $S \subseteq X \times X$ (similar) and $D \subseteq X \times X$ (dissimilar). If we just measure distance as $||x - y||_2$, we might not get the best way to represent this similarity/dissimilarity. Instead, we'd like to learn a distance (metric) $d(x, y)$ that is consistent with our labeling information. Then, given new pairs of samples we can tell if they are similar or dissimilar by comparing the distance in the new metric. We need to constrain the possible distance functions.

One possibility is to work in some linear transformation of L2 - use some $A \succ 0$ to define $d_A(x, y) = \sqrt{(x - y)^\top A (x - y)}$. The unit ball of this norm will look like an ellipsoid whose axes are the eigenvectors of $A$ with axis lengths the eigenvalues of $A$. Now, we want to find the best $A$. If $x_i, x_j \in S$ we want $d_A(x_i, x_j)$ to be small; if $x_i, x_j \in D$ we want $d_A(x_i, x_j)$ to be large. This produces the optimization problem $\min_A \sum_{x_i, x_j \in S} d_a(x_i, x_j)^2 \mathrm{\;s.t.\;} A \succeq 0, \sum_{x_i, x_j \in D} d_A(x_i, x_j) \ge 1$. This minimization is convex because each term in the sum is just a linear function of $A$. The constraint term $\sum_{x_i, x_j \in D} d_A(x_i, x_j) \ge 1$ is concave; we can make it convex via $1 - \sum_{x_i, x_j \in D} d_A(x_i, x_j) \le 0$. If we instead use squared differences in this term, the constraint becomes linear in $A$ and the optimization tends to produce low-rank $A$. We can reformulate this optimization problem as $\min -\sum_{x_i, x_j \in D} d_A(x_i, x_j) \mathrm{\;s.t.\;} \langle M, A \rangle \le 1, A \succeq 0$ where $\langle M, A \rangle$ encapsulates the term $\sum_{x_i, x_j \in S} d_A(x_i, x_j)^2$. This problem can be solved using a projected gradient algorithm, where at each step we project onto the intersection of the sets specified by the constraints.

To show that we can swap the objective function and the constraint, we look at the general problem $\min f(x) \mathrm{\;s.t.\;} g(x) \le 0$ where $f, g$ are convex and differentiable. If strong duality holds, a solution $x_\star \mathrm{\;s.t.\;} g(x_\star) \le 0$ is optimal if and only if $\exists \lambda_\star \ge 0 \mathrm{\;s.t.\;} \lambda_\star g(x_\star) = 0$, giving $\nabla L(x_\star, \lambda_\star) = \nabla f(x_\star) + \lambda_\star \nabla g(x_\star)$. $\lambda_\star$ is the Lagrange multiplier for the optimal solution of the original problem. Defining an unconstrained version $\Phi(x) = f(x) + \lambda_\star g(x)$ we would like to show that this is minimized at the same optimal point. If it also happens that $\lambda_\star > 0$ which is necessary if $\nabla f(x_\star) \ne 0$ then $\nabla g(x_\star) + \lambda_\star^{-1} \nabla f(x_\star) = 0$. This is the Lagrangian for the problem $\min g(x) \mathrm{\;s.t.\;} f(x) \le \beta$ where $\beta = f(x_\star)$. We've shown that $x_\star$ is again optimal for this problem - we can switch between these formulations.

This trick doesn't always hold - consider the problem $\min_{x \in \Re^2} x_1 + x_2 \mathrm{\;s.t.\;} (x_1 + 1)^2 + x_2^2 \le 1, (x_1 - 2)^2 + x_2^2 \le 4$. Call the constraints $g_1, g_2$ respectively. The feasible set (which satisfies both constraints) is just the origin, so the optimum must be $x_\star = (0, 0)$. If we define $\Phi(x_1, x_2) = f(x_1, x_2) + \beta_1 g_1 + \beta_2 g_2$, the gradient is $\nabla \phi(x_1, x_2) = (1, 1) + \beta_1(2(x_1 + 2), 2x_2) + \beta_2(2(x_1 - 2), 2x_2)$. Plugging in $x_\star = (0, 0)$ gives $(0, 0) = (1, 1) + (2\beta_1 - 4\beta_2, 0)$. There's no setting of $\beta_1, \beta_2$ which satisfies this problem. This problem doesn't have strong duality - Slatir's condition fails. In the linear metric learning problem, we are justified in moving the constraint into the objective.

Focusing back on $\min -\sum_{x_i, x_j \in D} d_A(x_i, x_j) \mathrm{\;s.t.\;} \langle M, A \rangle \le 1, A \succeq 0$, we can solve using the projected gradient iteration step $A^{(k+1)} = P_{C_1 \cap C_2} [A^{(k)} - t^{(k)} \nabla f( A^{(k)} )]$ where $C_1, C_2$ are the constraint sets for $\langle M, A \rangle \le 1, A \succeq 0$ respectively. We can project onto $C_2$ using an eigenvalue decomposition. Projection onto $C_1$ can be formulated as $P_{C_1} [H] = \min_{A \in C_1} \frac{1}{2}||A - H||_F^2 = H - \frac{(\langle H, M \rangle - 1)}{||M||_F^2} + M$. We now need to be able to project onto the intersection, given a method for projecting onto either set. One way to do this would be to project onto $C_1$, then project onto $C_2$, and repeatedly projecting back and forth until we reach a point in both $C_1$ and $C_2$. The overall problem we'd like to solve is $\min_A \frac{1}{2}||A - H||_F^2 + I_{A \in C_1} + I_{A \in C_2}$ where $I$ are indicator functions. We can artificially minimize over an additional variable $B$, giving $\min_{A, B} \frac{1}{2}||A - H||_F^2 + I_{A \in C_1} + I_{B \in C_2} \mathrm{\;s.t.\;} A = B$. We then get the Lagrangian $L_p (A, B, \Lambda) = \frac{1}{2}||A - H||_F^2 + I_{A \in C_1} + I_{B \in C_2} + \langle \Lambda, A - B \rangle + \frac{\rho}{2}||A - B||_F^2$. We can then use ADMM to optimize the Lagrangian.

Given some data samples $y_1, \ldots, y_p \in \Re^D$ we imagine that they live near some subset with much lower dimensional structure. We'd like to find $x_1, \ldots x_p \in \Re^d, d \ll D$ for some $\Omega \in [p] \times [p]$ where $d_{ij} \approx ||y_i - y_j||_2, (i, j) \in \Omega$ such that $||x_i - x_j|| \approx d_{ij}$. We are trying to find a smaller subspace where distance between points is roughly conserved. In some cases, we might not know the specific locations of $y_1, \ldots, y_p$ but we do know the distances between them. This problem can be solved in various ways using a semidefinite program.

Looking at the embedded points, there is some curvature in the high dimensional space which we'd like to get rid of - minimize the curvature so that it is roughly a flat space. In the projected space we'd like everything to be pushed as far apart as possible, subject to the distance constraints. This gives rise to the problem $\max_{x_1, \ldots x_p} \sum_i ||x_i - x_j||_2^2 \mathrm{\;s.t.\;} \forall i, j \in \Omega, ||x_i - x_j||^2 = d_{ij}^2, \sum_{i=1}^p x_i = 0$. The objective function is concave and the first constraint is an equality of convex functions so it is nonconvex. To convexify, we look for a way to turn this into a semidefinite program. We can look instead at the gram matrix $Q = [x_1 | \ldots | x_p]^\top [x_1 | \ldots | x_p]$ which gives $||x_i - x_j||_2^2 = Q_{ii} + Q_{jj} + 2Q_{ij}$. This allows us to instead solve the problem $\min_{Q \succeq 0} \sum_{ij} [Q_{ii} + Q_{jj} - 2Q_{ij}] \mathrm{\;s.t.\;} \forall i, j \in \Omega, Q_{ii} + Q_{jj} - 2Q_{ij} = d_{ij}^2, \sum_{i = 1}^p = 0$. The last constraint can be made linear by $\sum_{ij} Q_{ij} = 0$. If we write $Q_{ii} + Q_{jj} - 2Q_{ij} = \langle M_{ij}, Q \rangle$ where $M_{ij} = [1, -1; -1, 1]$ and note that the first constraint interacts with the objective, we get $\min_{Q \succeq 0} -\mathrm{tr}[Q] \mathrm{\;s.t.\;} \forall i, j \in \Omega, \langle M_{ij}, Q \rangle = d_{ij}^2, \sum_{ij} Q_{ij} = 0$. With the optimal matrix $\hat{Q}$ for this problem, we can compute $\hat{Q} = U\Lambda U^\top = U\Lambda^{1/2}\Lambda U^\top$ which gives us $\hat{x} = \Lambda^{1/2}U^\top$. Minimizing the negative trace is unusual because $\mathrm{tr}[Q] = \sum_i \lambda_i (Q) = \sum_i \sigma_i(Q)$. To solve at large scale, you would write down the Lagrangian $L_p(Q, \nu_{ij})$. Often, it doesn't really matter what the objective function is because the constraints are what are telling us how to arrive at the solution. We can evaluate our ability to solve this problem using rigidity theory - we'd like to know when $d_{ij}$ determines $x_i$ up to rigid body motion - that is, $\forall i, j \in \Omega, \frac{d}{dt} ||x_i(t) - x_j(t)||_2^2 = 0$.

Given a graph, we'd like to find a way to separate the graph into subgraphs which are not connected to each other. The graph $G$ can be described by its $p$ vertices $V$, edges $E$ and weights $W$. We can define a degree matrix $D \in \Re^{p \times p}$ whose entries on the diagonal are the degrees (number of connected vertices) for each vertex. We can define the graph laplacian to be $L = D - W$. The Laplacian describes how smooth functions are on the graph. A function on the graph is $f: [p] \rightarrow \Re$ or $f \in \Re^p$, giving $f^\top L f = f^\top(D-W)f = \sum_{ij} f_i f_j(D_{ij} - W_{ij}) = \sum_{ij} f_i f_j D_{ij} - \sum_{ij} f_i f_j W_{ij}$. Since $D$ is diagonal we can say $\sum_{ij} f_i f_j D_{ij} - \sum_{ij} f_i f_j W_{ij} = \sum_{i=1}^p f_i^2 D_{ii} - \sum_{ij} f_i f_j W_{ij}$. We can split the first sum to give $\frac{1}{2} \sum_{i = 1}^p f_i^2 D_{ii} + \frac{1}{2}\sum_{j=1}^p f_j^2 D_{jj} - \sum_{ij} f_i f_j W_{ij}$. From our definition of $D$ we then can write $\frac{1}{2} \sum_{ij} f_i^2 W_{ij} + \frac{1}{2}\sum_{ij} f_j^2 W_{ij} - \sum_{ij} f_i f_j W_{ij} = \frac{1}{2} \sum_{ij} W_{ij}(f_i - f_j)^2$. In other words, $f^\top L f$ is a measure of “smoothness” of the function $f$. This gives suggests that $\mathbf{1} \in \mathrm{null}(L)$ and $L \succeq 0$. If we are given a graph $G = (V, E, W)$ with $k$ connected components (a connected component is a subset $A \subseteq V \mathrm{\;s.t.\;} \forall i, j \in A$ with a path between $i$ and $j$ which does not leave $A$ and no has no edge joining $i \in A$ and $j \in A^C$) then the size of $\mathrm{null}(L)$ is $k$.

In practice a much more common scenario is data with strong clustering but a few weak links between the clusters. We would still like to separate the data into the $k$ subspaces. We would like to find sets $A_1, \ldots A_k$ whose disjoint union is just the vertex set. We can say $W(A, B) = \sum_{i \in A, j \in B} w_{ij}$, which gives a measure of the amount of damage caused by the cut $\mathrm{Cut}(A_1, \ldots A_k) = \sum_{i = 1}^k W(A_i, A_i^C)$. If $k = 2$, we have efficient algorithms for minimizing $\mathrm{Cut}$ over the disjoint union of $A_1$ and $A_2$. In some cases this minimization can cut the “wrong” links - it won't necessarily cut the links between clusters we're interested in. We'd like to balance the fraction of edge weight we cut to the size of the clusters we're left with. So instead of minimizing the total number of edges we cut, we want to minimize $\mathrm{RatioCut}(A_1, \ldots, A_k) = \sum_{i = 1}^k \frac{W(A_i, A_i^C)}{|A_i|}$. Another popular alternative is the normalized cut, $\mathrm{NormalizedCut}(A_1, \ldots, A_k) = \sum_{i = 1}^k \frac{W(A_i, A_i^C)}{\mathrm{Vol}(A_i)}$ where $\mathrm{Vol}(A_i) = \sum_{j \in A_i} d_j$. Both of these formulations are NP-hard and hard to approximate.

What should we do when a problem is nonconvex? Can we make it convex somehow?

Given $A \in \Re^{m \times m}, b \in \Re^m$, we seek an $x$ such that $Ax \approx b$. We can solve $\min_x ||Ax - b||_2^2$, the least squares problem. We may have additional knowledge about $x$ - say $x \ge 0$. This gives an additional constraint to the least squares problem. Another additional constraint may be that the entries live in some small discrete set like $x_i \in \{ \pm 1 \}$, called binary or boolean least squares. This problem is NP-hard in general. The issue is that we have a quadratic function of $x$ and some discrete constraint on $x$, which is not uncommon. An even more general problem is a quadratic constrained quadratic programming problem. The constraint that $x \in \{ \pm 1 \}$ is equivalent to $x_i^2 = 1$, so this is a QCQP if we write it as $\min ||Ax - b||_2^2 \mathrm{\;s.t.\;} x_i^2 = 1$. We can expand out to $\min x^\top A^\top Ax - 2b^\top Ax + b^\top b \mathrm{\;s.t.\;} x_i^2 = 1$. We can write $x^\top A^\top Ax = \mathrm{tr}[x^\top A^\top Ax] = \mathrm{tr}[x x^\top A^\top A] = \langle xx^\top, A^\top A \rangle$. Note that $[x x^\top]_{ii} = x_i^2$. We can rewrite the original problem as $\min \langle A^\top A, x x^\top \rangle + \langle -2 A^\top b, x \rangle + b^\top b \mathrm{\;s.t.\;} \mathrm{diag}(x x^\top) = \vec{1}$. The objective function can be rewritten as a big matrix inner product $\langle Q, X \rangle$ by stacking entries as $Q = [A^\top A, -A^\top b; -b^\top A, b^\top b]$ and $X = [xx^\top, x; x^\top, 1]$. Then the problem can be written as $\min \langle Q, X \rangle \mathrm{\;s.t.\;} \mathrm{diag}(X) = \vec{1}, X \succeq 0, \mathrm{rank}(X) = 1$, a rank-constrained semidefinite program. Now the difficulty of the problem is encapsulated in the constraint $\mathrm{rank}(X) = 1$. One way to try to deal with this would be to just drop the rank constraint altogether - $\min \langle Q, X \rangle \mathrm{\;s.t.\;} \mathrm{diag}(X) = \vec{1}, X \succeq 0$. This gives a bigger feasible set which implies that we get an optimal objective value which is no larger than the one we had for the previous problem - it is a lower bound for the original problem. If $X^\star$ solves the relaxed problem and $\mathrm{rank}(X^\star) = 1$ then we haven't lost anything when relaxing the problem.

$\min x^\top Ax \mathrm{\;s.t.\;} x^\top Bx = 1$ where $A, B$ are symmetric. We can use quadratic lifting to instead solve $\min \langle A, X \rangle \mathrm{\;s.t.\;} \langle B, X \rangle \le 1, X = xx^\top$. If we let $W(A, B) = \{ ( \langle A, X \rangle, \langle B, X \rangle ) \;|\; X \succeq 0 \}$, then $W$ is a cone. Since $\{ X \;|\; X \succeq 0\} = \mathrm{conv}(\{x x^\top \;|\; x \in \Re^n \})$, we have $W(A, B) = \mathrm{conv}( \{ ( \langle A, xx^\top \rangle, \langle B, xx^\top \rangle ) \;|\; x \in \Re^n \}) = \mathrm{conv}( \{ (x^\top Ax, x^\top Bx) \;|\; x \in \Re^n \})$. So there always exists a rank-one solution. More generally, we can try to solve problems of the form $\min x^\top AX + b^\top x + c \mathrm{\;s.t.\;} x^\top A^\prime x + b^{\prime \top} x + c \le 0$, the “trust region problem”. The SDP relaxation is always exact.

Given a weighted undirected ($W = W^\top$) graph defined by its vertices, edges and weights $G = (V, E, W)$ we seek $S \in V = \{1, \ldots, n\}$ which maximizes $\mathrm{cut}(S) = \sum_{i \in S, j \in S^C} W_{ij}$. If we set $x_i = 1, i \in S; -1, i \in S^C$ then $1 - x_i x_j = 2, x_i \ne x_j; 0 \mathrm{\; otherwise}$. Then we have $\mathrm{cut}(x) = \frac{1}{4} \sum_{ij} (1 - x_i x_j) W_{ij} = \frac{1}{4}\left( \sum_i x_i^2 \sum_j W_{ij} - x^\top W x \right) = \frac{1}{4} \left( x^\top Dx - x^\top Wx \right) = \frac{1}{4}x^\top Lx$. So we can write the original problem as $\max_{i \in \{\pm 1\}^n} \frac{1}{4} x^\top Lx$. We know that $L \succeq 0$, and we want to maximize $x^\top Lx$ where $x \in \{\pm 1\}^n$.

This is a binary quadratic maximization, which we can relax as above to $\max \langle A, X \rangle \mathrm{\;s.t.\;} X \succeq 0, \mathrm{diag}(X) = \vec{1}$, a semidefinite program. If we suppose $X^\star$ solves this SDP, then write $X^\star = V^\top V$ where $X^\star = U \Lambda U^\top, V = \Lambda^{1/2}U^\top$. If $\mathrm{rank}(X^\star) = 1$ we're done. Otherwise set $\hat{x} = \mathrm{sgn}(V^\top z)$ where $z \sim_{iid} \mathcal{N}(0, 1)$. Then $\hat{x}$ is a binary vector which is a solution to the Binary Quadratic Maximization problem - but how good is it?

**Theorem**
$\mathbb{E}[\hat{x}^\top A\hat{x}] \ge \frac{2}{\pi} \max_{x \in \{\pm 1\}^n} x^\top Ax$.

**Proof**
Note $\mathbb{E}[\hat{x}^\top A \hat{x}] = \mathbb{E}[\langle A, \hat{x}\hat{x}^\top \rangle] = \langle A, \mathbb{E}[\hat{x}\hat{x}^\top] \rangle$. The $i, j$th entry of $\mathbb{E}[\hat{x}\hat{x}^\top]$ is $\mathbb[\mathrm{sgn}(v_i^\top z)\mathrm{sgn}(v_j^\top z)]$. Note that $\mathbb{E}[\mathrm{sgn}(v_i^\top z)\mathrm{sgn}(v_j^\top z)] = \frac{2}{\pi} \mathrm{arc}\cos (v_i^\top v_j)$ so $\mathbb{E}[\hat{x}\hat{x}^\top] = \frac{2}{\pi} \mathrm{arc}\cos(v_i^\top v_j)$. If $A \succeq 0, B \succeq 0$ then $\langle A, B \rangle \ge 0$ because if $\Gamma \succeq 0, e_i^\top \Gamma e_i \ge 0 \forall i$ so if we write $A = U \Lambda U^\top$ then $\langle A, B \rangle = \mathrm{tr}[AB] = \mathrm{tr}[U\Lambda U^\top B] = \mathrm{tr}[\Lambda U^\top B U] = \langle \Lambda, U^\top B U\rangle = \sum_i \lambda_i e_i^\top \Gamma e_i \ge 0$ where $\Gamma = U^\top B U$. Note also that if $A \succeq 0, B \succeq 0$ then $A \odot B \succeq 0$ because $A \odot B x = \mathrm{diag}(A \mathrm{diag}(x) B)$ so $x^\top A \odot B x = \langle x, \mathrm{diag}(A \mathrm{diag}(x) B) \rangle = \langle \mathrm{diag}(x), A \mathrm{diag}(x) B \rangle = \langle \mathrm{diag}(x) A \mathrm{diag}(x), B \rangle$ which is the inner product of two positive semidefinite matricies. Finally note that if $X \succeq 0$ and $|X_{ij}| \le 1$ then $\mathrm{arc}\sin(X) - X \succeq 0$ because the first order Taylor approximation to $\mathrm{arc}\sin(X)$ is $X$ plus a positive semidefinite term. Combining these facts, we have $\mathbb{E}[\hat{x}^\top A \hat{x}] = \mathbb{E}[ \langle A, \hat{x}\hat{x}^\top \rangle ] = \langle A, \mathbb{E} [\hat{x} \hat{x}^\top] \rangle = \frac{2}{\pi} \langle A, \mathrm{arc}\sin(X) \rangle = \frac{2}{\pi}(\langle A, X \rangle + \langle A, \mathrm{arc}\sin(X) - X \rangle ) \ge \frac{2}{\pi}\langle A, X \rangle \ge \frac{2}{\pi} \max_{x \in \{\pm 1\}^n} x^\top Ax$.

We can actually do better than this - a $.878$ bound is known with the same algorithm. We can use these ideas for other problems, eg by using the Taylor series of a constraint function instead in minimization.

Given a problem of the form $\min f(x)$, we can try to solve this problem with a surrogate $\hat{f}$ which will approximate $f$ but is better structured for optimization. One good choice is the convex envelope, $\hat{f}(x) = \mathrm{conv}_D(f)(x) = \mathrm{sup}\{ g(x) \;|\; g(x) \le f(x) \forall y \in D \}$. We consider all functions $g$ which lie below $f$ and choose the largest one. Since a supremum of convex functions is convex, $\hat{f}(x)$ is convex.

If we'd like to choose a model which is “simple”, we are giving convex penalties for “complexity”. Usually a function of this type is nonconvex, such as the number of nonzero elements of $x$ written $||x||_0$. If we take $D = \{x | ||x||_\infty \le 1\}$ then the convex envelope is the L1 norm $||x||_1$.

The rank function $\mathrm{rank}(A)$ over the domain $D = \{A \;|\; ||A|| \le 1\}$ then the convex envelope is $\hat{f} \sum \sigma_i(A)$.

If we look at all of the hyperplanes which lie below $f : \Re^n \rightarrow \Re$, then geometrically it is intuitive that we will be given $\hat{f}$ by looking at their pointwise supremum. The convex conjugate of a function $f$ is a new function $f^\star : \Re^n \rightarrow \Re \cup \{\infty\}$ defined as $f^\star(y) = \mathrm{sup}_x \langle y, x \rangle - f(x)$. This is always a convex function of $y$. To see how the convex conjugate helps, look at $-f^\star (y) = -\mathrm{sup}_x \langle y, x \rangle - f(x) = \mathrm{inf}_x f(x) + \langle y, -x \rangle$. If we imagine that $f$ is differentiable then we want to characterize $x^\star (y)$ (the minimizer of $-f^\star(y)$ with $y$ fixed) then we can set $\nabla f(x^\star(y)) - y = 0 \rightarrow \nabla f(x^\star (y)) = y$. Then $-f^\star (y) = f(x^\star(y)) + \langle \nabla f(x^\star(y), 0 - x^\star(y) \rangle$. So the convex conjugate measures the $y$-intercept of all of the lines which are tangent to $f$. In other words, the convex conjugate just gives a way of recording all of the hyperplanes which lie tangent to the function. To get back, we can take the biconjugate $(f^\star)^\star$, which is $f^{\star\star}(x) = \mathrm{sup}_y \langle x, y \rangle - f^\star(y) = \mathrm{sup}_y \langle x, y \rangle - \sup_z \langle y, z \rangle - f(x) = \sup_y \langle x, y \rangle + \mathrm{inf}_z -\langle y, z \rangle + f(z) = \mathrm{sup}_y \mathrm{inf}_z f(z) = \langle y, x - z \rangle$. To show that $f^{\star\star}$ is the convex envelope, we need that $f^{\star\star} \le f$ and $g \le f^{\star\star} \forall \mathrm{\; convex\;} g \le f$. We have that $f^{\star\star} \le f$ because $\mathrm{sup}_y \mathrm{inf}_z f(z) - \langle y, x - z \rangle \le \sup_y f(x) = f(x)$. Now, if $g \le f$ then $g^\star(y) = \mathrm{sup}_x \langle y, x \rangle - g(x) \ge \mathrm{sup}_x \langle y, x \rangle - f(x) = f^\star{y}$. Applying this trick again, we have $g^{\star\star} \le f^{\star\star}$, so we need to show that if $g$ is differentiable and convex, then $g = g^{\star\star}$. Note that $g^{\star\star}(x) = \mathrm{sup}_y \mathrm{inf}_z g(z) + \langle y, x - z \rangle$. If $g$ is differentiable and convex, then if we set $y = \nabla g(x)$ then $z^\star(y) = x$ is a minimizer. So $g^{\star\star} \ge g(x) + \langle \nabla g(x), x - z^\star(y) \rangle = g(x)$, and we also had that $g^{\star\star}(x) \le g(x)$ so $g^{\star\star} = g$ so $f^{\star\star} \le f$ under mild conditions on $f$.

Consider $f(x) = ||x||_0 + \mathcal{I}_{||x||_\infty \le 1}$, where $\mathcal{I}_{||x||_\infty \le 1}$ is an indicator function which is $0$ when $||x||_\infty \le 1$ and $\infty$ otherwise. So $f^\star(y) = \mathrm{sup}_x \sum_i x_i y_i - ||x||_0 - \mathcal{I}_{||x||_\infty \le 1} = \mathrm{sup}_x \sum_i x_i y_i - \vec{1}_{x_i \ne 0} - \mathcal{I}_{|x_i|_\infty \le 1} = \mathrm{sup}_{|x_i| \le 1} \sum_i x_i y_i - \vec{1}_{x_i \ne 0} = \sum_i (|y_i - 1|)_+$. Then $f^{\star\star} = \mathrm{sup}_y \sum_i x_iy_i - (|y_i| - 1)_+$. If $|x_i| \le 1$, we set $y_i^\star = \mathrm{sgn}(x)$; otherwise we set $f^{\star\star} = \infty$. So $f^{\star\star} = \sum_i |x_i|$ if $||x||_\infty \le 1$ and $\infty$ otherwise.

The convex conjugate trick finds all hyperplanes below a function and then transforms back to a function which is the supremum of these hyperplanes. This is similar to the Lagrangian dual problem. For the quadratic constrained quadratic program $\min_x x^\top Ax \mathrm{\;s.t.\;} x^\top B_i x + c_i \le 0, i \in \{1, \ldots, n\}$, if we take the dual we have $L(x, \lambda) = x^\top Ax + \sum_i \lambda_i (x^\top B_i x + c_i) = x^\top (A + \sum_{i = 1}^n \lambda_i B_i)x + \lambda^\top c$ and $d(\lambda) = \mathrm{inf}_x L(x, \lambda)$, which will be $-\infty$ if $A + \sum_i \lambda_i b_i \prec 0$ and $\lambda^\top c$ if $A = \sum_i \lambda_i B_i \succeq 0$. So the dual problem gives $\max_{\lambda \ge 0} \lambda^\top c \mathrm{\;s.t.\;} A + \sum_i \lambda_i B_i \succeq 0$ or equivalently $\min_{\lambda \ge 0} -\lambda^\top c \mathrm{\;s.t.\;} -(A + \sum_i \lambda_i B_i) \preceq 0$. If we now write down $\tilde{L}(\lambda, Z) = -\lambda^\top c + \langle Z, -(A + \sum_i \lambda_i B_i) \rangle$ we have the dual dual $\tilde{d}(z) = \min_{\lambda \ge 0} - \lambda^\top c + \langle Z, -(A + \sum_i \lambda_i B_i) \rangle = \min_{\lambda \ge 0} \sum_i -\lambda_i (c_i + \langle Z, B \rangle ) - \langle A, Z \rangle$. The minimizer is then $-\langle A, Z \rangle$ when $C_i + \langle Z, B \rangle \le 0$ and $-\infty$ otherwise. So our problem is then $\max_{Z \succeq 0} \tilde{d}(Z)$ or $\min_{Z \succeq 0} \langle A, Z \rangle \mathrm{\;s.t.\;} \langle B, Z \rangle + C_i \le 0, \forall i$. So in general, to get these tricks we can take the Lagrangian bi-dual and we get these tricks to solve these hard problems.

elen_4835_course_notes.txt · Last modified: 2015/12/17 21:59 (external edit)