elen_e6886_course_notes

There is a huge amount of data being generated on the internet, and in scientific experiments, etc. For some data, we may need to sample it and compress it to represent it, and we may want to predict new data, label it, and cluster/classify data. Most of this data is very high dimensional - images contain over a million pixels, videos contain over a billion voxels. We may also be dealing with very large datasets. Classically, we have <math>n \rightarrow \infty</math> samples in <math>p</math> (fixed) dimensions, and there are a great number of tools used to help decide whether predictions we make are reliable. Experiments could also be designed to produce reliable and complete data. Today, we have a situation where sometimes we have about as many samples as dimensions (and sometimes more) and in many cases the data is incomplete or erroneous. When data is so high dimensional, we require many more observations to learn functions.

We can often represent the observations by transforming the data to a different basis which may be sparse - most of the coefficients are close to zero. A sparse approximation of the signal would be one where the input <math>y</math> is approximated as the sum of some elements taken from some dictionary (matrix) <math>\psi</math>, where most of the sum coefficients are zero <math>y\approx\psi x</math>, most of <math>x</math> is zero. This depends on the choice of the dictionary - if it's good, most coefficients will hopefully be close to zero.

We will often represent a vector as the linear combination of entries in a matrix as <math>y = Ax</math> and <math>y_i = \langle A_i, x \rangle</math> where <math>y \in \Re^m</math>, <math>x \in \Re^n</math>, <math>n > m</math> and <math>a \in \Re^{m \times n}</math>, <math>\operatorname{dim}(\operatorname{Null}(A)) \ge n - m</math>. We have fewer observations than unknowns in the system of equations - there are many continuous signals compatible with the observations. We obtain the matrix <math>A</math> from some transformation or by learning from a large amount of data. We can also add in an undetermined term by <math>y = Ax + e = [AI][x,e]</math>.

If there's a particular signal <math>x</math> you're trying to recover, it could be difficult because there are many solutions to the equation. To overcome this, we need to make a structural assumption about <math>x</math>. We may assume that <math>\|x\|_0 = {i \mid x_0(i) \ne 0}</math> is small - the vector is sparse. We can also require that the vector has “structured sparsity” where certain parts of it need to be more sparse than others. Often we will not assume that <math>x</math> is sparse, but that <math>x = \psi z</math> and <math>z</math> is sparse so that we have <math>y = A\psi z</math> where we now have a new unknown <math>z</math>. In order to solve the system of equations, <math>x_0</math> being sparse indicates that we can choose the solution which gives a maximally sparse <math>x_0</math> - minimizes <math>\|x\|_0</math> (the <math>l^0</math> norm) - constraint <math>P0</math>.

A function is a norm if it's nonnegative homogeneous <math> \forall \alpha, \|\alpha x\| = |\alpha\||x|</math>, positive definite <math>\forall x, \|x\| \ge 0</math>, and satisfies the triangle inequality <math>\|x-y\| \le \|x\| + \|y\|</math>. Some common norms are the Euclidian norm <math>\|x\|_2 = \sqrt{\sum_i x_i^2}</math>, the Manhattan distance <math>\|x\|_1 = \sum{|x_i|}</math> and <math>\|x\|_{\infty} = \max(x)</math>. The unit ball (the set of points of distance 1 from a fixed central point) of each of these norms are the circle, triangle, and square… all other <math>\|x\|_p</math> are somewhere in between depending on <math>p</math>. The <math>l_0</math> as defined above is not nonnegative homogeneous because multiplying by a scalar does not change the number of nonzero entries, but it's still called a norm. It can be defined by <math>\lim_{p\rightarrow 0} \|x\|_p^p = \lim_{p\rightarrow 0} \sum_i{|x_i|^p} = \|x\|_0</math>.

If <math>y = Ax_0</math>, and <math>y = Ax^\prime</math> and <math>\|x^\prime\|_0 \le \|x_0\|_0 = k</math> then we have <math>A(x_0 - x^\prime) = 0</math> so <math>x_0 - x^\prime = x</math> is in the null space of <math>A</math> and is sparse itself, <math>\|x\|_0 \le 2k</math>. In order to show that this second solution <math>x^\prime</math> can't exist, it is sufficient to have that every member of <math>\operatorname{Null}(A)</math> is more than <math>2k</math>-sparse so that <math>\|x\|_0 \not\le 2k</math>. We can define <math>\operatorname{spark}(A) = \min\{\|v\|_0}} \mid v \ne 0, v \in \operatorname{Null}(A)\}</math>, the number of nonzero entries in the sparsest nullvector of <math>A</math>. So, whenever <math>y = Ax_0</math> and <math>\|x_0\|_0 < \frac{\operatorname{spark}(A)}{2}</math> then <math>x_0</math> is the unique sparsest solution of <math>y = Ax_0</math>. If the spark is large, then every null vector is dense and we can recover any sparse solution of <math>y = Ax</math>.

- Say <math>A</math> with dimensions <math>m,n</math> is chosen at random as <math>A_{i, j} \sim_{iid} \mathcal{N}(0,1)</math> the <math>\operatorname{spark}(A) = m + 1</math> (very large). So for generic matrices the spark is pretty big, which makes recovery easier.
- The matrix consisting of an identity matrix appended to the discrete Fourier basis will have <math>\operatorname{spark}</math> of <math>2\sqrt{m}</math> when <math>m</math> is square.
- An identity matrix with a single column vector of zeros has a <math>\operatorname{spark}</math> of <math>0</math>.

A naive solution to P0 could be found with an exhaustive search, which takes exponential time. P0 is an NP-hard computation problem - it can't be solved efficiency. To prove, it can be made equivalent to the exact 3-cover problem, where you have a set <math>S = \{1, \ldots, m\}</math> and subsets <math>C_1, \ldots, C_n</math> and <math>|C| = 3</math> where the subsets contain each element of <math>S</math> exactly once. If we construct <math>A_{i,j} = 1, i \in C_j</math> and <math>0</math> elsewhere and <math>y = 1_m</math> (the <math>m</math>-dimensional vector of ones), then <math>y = Ax</math> has a solution x with <math>\|x\|_0 < \frac{m}{3}</math> if and only if there exists an exact 3-cover. In one direction, if there exists an exact 3-set cover <math>C^\prime</math>, then <math>|C^\prime| = m/3</math> because there is one set for every three of the <math>m</math> elements in <math>S</math>. Then if we create <math>x</math> with <math>x_j = 1, U_j \in C^\prime</math> and 0 otherwise, multiplying <math>Ax</math> will solve the equation <math>Ax = y</math> because <math>x</math> will “select” solely the ones in <math>A</math>, generating the vector of 1s <math>y</math>. In the other direction, if we assume there is a solution <math>x_0</math> with <math>||x_0||_0 = m/3</math> and set <math>C^\prime = \{U_j \mid x_0(j) \ne 0\}</math> (again “selecting” subsets according to the nonzero entries of <math>x_0</math>). Each column of <math>A</math> has three nonzero entries (corresponding to the three integers in <math>U_j</math>) and <math>A_I</math> has at most <math>m/3</math> columns (corresponding to the nonzero entries of <math>x_0</math> so <math>A</math> has at most <math>m</math> nonzero entries. <math>A_Ix_0(I) = y</math> because we are just omitting the entries of <math>x_0</math> which are <math>0</math> (which would not count towards the multiplication anyways). So each row of <math>A_I</math> must have at least one nonzero entry. There are <math>m</math> rows in <math>A</math> so each row must have exactly one nonzero entry. This indicates that <math>C^\prime</math> covers <math>S</math> because each integer will only appear once in each of <math>U_j</math>.

A set <math>C</math> is convex if <math>\forall x, y \in C, \alpha \in [0, 1], \alpha x + (1- \alpha)y \in C</math>. For a function defined on the convex set <math>D \in \Re^n</math> such that <math>f: D \rightarrow \Re</math>, <math>f</math> is convex if it's “bowl-shaped”: if <math>\forall x, y \in D, \alpha \in [0, 1], f(\alpha x + (1-\alpha)y) \le \alpha f(x) + (1 - \alpha)f(y)</math>. Note that we are essentially plugging in the linear interpolation of <math>x</math> and <math>y</math> and testing that it's below the linear interpolation of <math>f(x)</math> and <math>f(y)</math>. We can alternatively say that the “epigraph” (the set of all points above the function <math>\operatorname{epi}(f) = \{(x, t) \mid x \in D, f(x) \le t\}</math>) is a convex set. Graphically, if the interpolation between two points on a function lie above the function, it's convex; if the line between two points in a set lies within the set, it's convex. When optimizing, if the function is not globally convex, there's no way of knowing that you have found a maximum or minimum - if a function has local maxima or minima, it is not convex. We can say that an optimization problem <math>\min f(x) \mid x \in C</math> is convex if <math>f</math> is a convex function and <math>C</math> is a convex set, and the problem can be solved globally.

We want to find a convex function which has many of the same properties as the <math>l^0</math> norm except that it's convex so that it can be solved globally - a “convex surrogate”. Given some index set <math>B</math>, and a collection of convex functions <math>f_\beta(x) \mid \beta \in B</math>, we can define the piecewise supremum <math>g(x) = \sup_\beta f_\beta (x)</math> and <math>g</math> is convex. Define the convex envelope of a function <math>f(x)</math> over a set <math>D</math> as <math>g(x) = \sup\{h(x) \mid h \operatorname{convex}, h(y) \le f(y) \forall y \in D\}</math>. You look at all the convex functions which lie below <math>f</math> and take their maximum - the “best convex underestimator” of <math>f</math> - the largest convex function which is never larger than <math>f</math>. For the <math>l^0</math> norm, we are looking for any convex function <math>h</math> which lies below all points on the <math>l^0</math> norm. In one dimension, <math>|x|</math> is the convex envelope of <math>\|x\|_0</math> over <math>[-1, 1]</math>. More generally, the convex envelope of <math>f(x) = ||x||_0</math> over <math>B_\infty = \{x \mid \|x\|_\infty = 1\}</math> is <math>g(x) = \|x\|_1</math>, which we can prove by induction. This motivates looking at the <math>\|x\|_1</math> as a good estimator for the <math>l^0</math> norm. This is essentially saying that given the affine solution space, we should look for the member in the space whose norm is the smallest. Using the <math>l^1</math> norm can be thought of as expanding the unit ball until it hits the affine solution space. The <math>l^1</math> norm is the convex norm which is the “closest” to <math>l^0</math>. If we used <math>l^2</math>, it's possible that a non-sparse solution could be chosen. In higher dimensions, it's possible that the <math>l^1</math> norm ball also hits a point which is not sparse - we always want the point in the solution space on a coordinate axis. As you raise the dimension, the <math>l^1</math> unit ball becomes relatively “thinner” so this becomes less of a problem.

Under certain circumstances, we can recover the sparsest solution <math>x_0</math> to <math>y = Ax</math> by minimizing <math>\|x\|_1</math>. Intuitively, it may be easier to recover the sparsest solution if the columns of <math>A \in \Re^{m \times n}</math> are “far apart” in the high dimensional space <math>\Re^m</math>. We can quantify the “closeness” of the columns as the coherence, <math>\mu(A) = \max_{i \ne j} \frac{|\langle A_i, A_j\rangle|}{\|A_i\|_2\|A_j\|_2}</math>.

A symmetric (<math>M \in \Re^{k \times k} = M^\top</math>) matrix is positive semidefinite (<math>M \succeq 0</math>) if <math>\forall x \in \Re^k</math>, <math>x^*Mx \ge 0</math>. When a matrix is positive semidefinite, we can find <math>k</math> eigenvalues <math>\lambda_1(M) \ge \cdots \ge \lambda_k(M) \ge 0</math> and a corresponding orthonormal basis <math>U = \[ u_1 \mid \cdots \mid u_k \]</math> of eigenvectors such that <math>U^*U = I</math> and <math>Mu_j = \lambda_ju_j</math> and <math>M = U\Lambda U^*</math> where <math>\Lambda</math> is the <math>k \times k</math> matrix with eigenvalues down its diagonal and zeros elsewhere. The <math>l^2</math> operator norm is <math>\|M\| = \sup_{\|x\| = 1}\|Mx\|</math>. When <math>M \succeq 0</math>, <math>\|M\| = \lambda_1(M)</math>, the largest eigenvalue. If <math>M \succeq 0</math> and <math>\lamda_k(M) > 0</math>, then <math>M</math> is invertible and the eigenvalues of <math>M^{-1}</math> are the reciprocals of the eigenvalues of <math>M</math>: <math>\|M^{-1}\| = 1/\lambda_k(M)</math>. When <math>M</math> is diagonal, the eigenvalues are just the ordered diagonal elements because when multiplying it acts as a vector.

When <math>M</math> has mostly small terms off of the diagonal, the eigenvalues are close to the terms on the diagonal: <math>|M_{ii} - \lambda| \le \sum_{j \ne i} |M_{ij}|</math> for some <math>i</math>. Since <math>\lambda</math> is an eigenvector, we have <math>Mx = \lambda x</math>. If <math>i</math> is the index of element <math>x</math> with largest magnitude, we have <math>\sum_j M_{ij}x_j = \lambda x_i </math>. Subtracting <math>M_{ii}</math> from both sides gives <math>(M_{ii} - \lambda)x_i = \sum_{j \ne i} M_{ij}x_j</math>. Taking the absolute value, we have <math>|M_{ii} - \lambda| = |\sum_{j \ne i}M_{ij}x_j|/|x_i| \le \sum_{j \ne i}|M_{ij}||x_j|/|x_i| \le \sum_{j \ne i} |M_{ij}|</math> because <math>x_j \le x_i</math> so <math>|x_j|/|x_i| \le 1</math>.

This theorem can help provide a bound on the <math>\operatorname{Spark}</math> of a matrix (which can help decide if a given solution to <math>y = Ax</math> is the unique sparsest solution). Specifically, we have <math>\operatorname{Spark(A)} \ge 1 + 1/\mu(A)</math>. Assuming that the columns of <math>A</math> are normalized such that <math>\|A_i\|_2 = 1</math>, we have <math>\mu(A) = \max_{i \ne j}|\langle A_i, A_j \rangle|</math> because the denominator (from the definition of <math>\mu(A)</math> will be 1. If we have some <math>k < 1 + 1/\mu(A)</math>, then for any subset of size <math>I</math>, the columns of <math>A_I</math> are linearly independent if and only if <math>M = A_I^*A_I</math>. **Is k not some integer? Subset of what?** <math>M</math> is positive semidefinite **why?** and its entries are inner products of the columns of matrix <math>A</math>, such that the diagonal entires are <math>1</math> and the off-diagonal elements are bounded by <math>\mu(A)</math> because <math>\mu(A)</math> is defined as the largest inner product between columns of <math>A</math>. By the Gershgorin Disc Theorem we have <math>|1 - \lambda_k(M)| \le \sum_{j\ne i}|M_{ij}|</math> (replacing <math>M_{ii}</math> with 1). The summation on the right has <math>k-1</math> terms, each less than <math>\mu(A)</math> so we have at most <math>|1-\lambda_k(M)| \le (k-1)\mu(A)</math> and so <math>-(k-1)\mu(A) \le 1 - \lambda_k(M) \le (k-1)\mu(A)</math>, rearranging the right inequality gives <math>-\lambda_k(M) \le -1 + (k-1)\mu(A) \rightarrow \lambda_k(M) \ge 1 - (k-1)\mu(A)</math>. In order for <math>\lambda_k</math> to be positive, we need <math>k < 1 + 1/\mu(A)</math>, which would imply that the columns of <math>A_I</math> are linearly independent for any <math>I</math> of size <math>k</math> which sets a bound on the Spark. This gives us the constraint that whenever <math>x_0</math> has at most <math>\frac{1}{2}(1 + 1/\mu(A))</math> nonzero entries, it's the unique solution (because <math>\operatorname{Spark}(A)/2 \ge \frac{1}{2}(1 + 1/\mu(A))</math> so if <math>\|x_0\|_0 < \frac{1}{2}(1 + 1/\mu(A))</math> it's also going to be smaller than <math>\operatorname{Spark}(A)/2</math>).

Given a convex function <math>f : \Re^n \rightarrow \Re</math>, <math>x_0</math> is the optimal solution to <math>\min f(x) \mid y = Ax</math> if and only if there exists a <math>\nu</math> such that <math>\nabla L(x_0, \nu) = 0</math> where <math>L(x, \nu) = f(x) + \langle \nu, y - Ax \rangle</math> is the Lagrangian. Requiring <math>\nabla L(x_0, \nu) = 0</math> is equivalent to requiring that <math>A^*\nu = \nabla f(x_0)</math> - we need to find a <math>\nu</math> which satisfies this equation to guarantee that <math>x_0</math> is optimal. <math>\|x\|_1</math> is not differentiable, so we need to generalize the notion of the gradient <math>\nabla</math>. Any differentiable function can be linearly approximated with the gradient near any value <math>f(x) \approx f(x_0) + \langle f(x_0), x - x_0\rangle</math>; when <math>f</math> is convex, this approximation serves as a lower bound <math>f(x) \ge f(x_0) + \langle f(x_0), x-x_0 \rangle \forall x, x_0</math>. We can approximate this condition with a “subgradient” <math>\gamma \in \Re^n</math> of <math>f</math> at <math>x_0</math> if <math>f(x) \ge f(x_0) + \langle \gamma, x - x_0</math>, forming the subdifferential <math>\delta f(x_0) = \{\gamma \mid f(x) \ge f(x_0) + \langle \gamma, x - x_0 \}</math>. When a function is differentiable at <math>x_0</math>, <math>\delta f(x_0) = \{\nabla f(x_0)\}</math>; otherwise, many points may be included in the subdifferential. In <math>\Re^n</math> we have <math>\delta |x|</math> is <math>\gamma = 1</math> when <math>x_i > 0</math>, <math>\gamma_i = -1</math> when <math>x_i < 0</math>, and is <math>\gamma_i = [-1, 1]</math> when <math>x_i = 0</math> (in general, if a function <math>f : \Re^n \rightarrow \Re</math> is the sum of <math>n</math> functions <math>f_i : \Re \rightarrow \Re</math> then <math>\delta f(x) = \{\gamma \mid \gamma_i \in \delta f_i(x_i) \}</math>). In P1, <math>x_0</math> will be optimal if <math>\exists \nu \mid A^*\nu \in \delta f(x_0)</math> because given another solution <math>x^\prime</math>, <math>Ax^\prime = Ax_0 = y \rightarrow A(x^\prime - x_0) = 0</math>. Plugging <math>x^\prime</math> into the definition of the subgradient, we have <math>f(x^\prime) \ge f(x_0) + \langle \gamma, x - x_0 \rangle \rightarrow \|x^\prime\|_1 \ge \|x_0\|_1 + \langle \gamma, x - x_0 \rangle</math>. Since we require that <math>\exists \nu \mid A^*\nu \in \delta f(x_0)</math>, we have <math>\|x^\prime\|_1 \ge \|x_0\|_1 + \langle A^*\nu, x^\prime - x_0 \rangle = \|x_0\|_1 + \langle \nu, A(x^\prime - x_0)\rangle = \|x_0\|_1</math>, so <math>x^\prime</math> is not “more optimal” than <math>x_0</math>. In addition, it can be shown that if <math>x</math> is a solution to P1 with support (non-zero values) <math>I</math>, and <math>A_I</math> has full column rank <math>I</math>, and <math>\exists \nu \mid A^*\nu = \operatorname{sign}(x_i), i \in I; |A_i^*\nu| < 1, i \in I^C</math>, then <math>x</math> is the unique optimal solution. This implies that if <math>y = Ax_0</math> with <math>\|x_0\|_0 < \frac{1}{2}(1 + 1/\mu(A)</math>, <math>x_0</math> is the unique sparsest solution to P1.

In general, we can do better than the result above to guarantee that a sufficiently sparse <math>x_0</math> is the unique solution to P1, based on a further requirement of <math>A</math>: The Restricted Isometry Property, which is satisfied with order <math>r</math>, constant <math>\delta > 0</math> if <math>\forall z \mid \|z\|_0 \le r, (1 - \delta)\|z\|_2^2 \le \|Az\|_2^2 \le (1 + \del)\|z\|_2^2</math>. In other words, for sparse (<math>\|z\|_0 \le r</math> vectors, multiplying by <math>A</math> does not change the <math>l^2</math> norm a great deal (within <math>1 \pm \delta</math>). An linear map <math>L : V \rightarrow W</math> is an isometry if <math>\forall v \in V, \|Lv\|_W = \|v\|_V</math> (<math>\|\cdot\|_X</math> is the norm for vector space <math>X</math>) - the “restricts” isometry to sparse vectors and is an approximation (the <math>\delta</math> factor) of true isometry. It must be approximate because <math>A</math> has vectors in its null space for which <math>\|z\|_2^2 \ne 0</math>. For some <math>r</math>, we can define <math>\delta_r(A)</math> to be the smallest <math>\delta</math> that satisfies RIP for <math>A</math>. Coherence implies that a matrix will have RIP of order <math>2</math> with constant <math>r = \mu(A)</math>, or that every column submatrix of size 2 is well-conditioned because it is comparing two matrix entries at a time. The RIP of order <math>k</math> ensures that every submatrix of <math>k</math> columns is well-conditioned.

If <math>\delta_r(A) < 1</math>, every <math>r</math> column submatrix of <math>A</math> has full rank so <math>\operatorname{Spark}(A) > r</math> because at least <math>r</math> columns of <math>A</math> are not linearly dependent. The RIP demands that the submatrices are full rank AND well-conditioned (a small change in <math>y</math> will be reflected as a small change in <math>x</math>). If <math>\delta_{2k}(A) < 1</math> and <math>y = Ax_0</math> with <math>\|x_0\|_0 \le k</math>, <math>x_0</math> is the unique solution to P0 - an alternate solution would violate the RIP. For P1, we have that if <math>\delta_{2k}(A) < \delta_*</math>, whenever <math>y = Ax_0</math> with <math>\|x_0\|_0 \le k</math>, <math>x_0</math> is the unique optimal solution to P1. <math>\delta_*</math> is a constant whose value is unknown, estimated to be at least 0.46. If <math>\delta_{2k}</math> for some <math>k</math> is less than this quantity (implying a certain niceness for <math>A</math>), we can guarantee a unique optimal solution. Unfortunately, the RIP is not easy to assess for a given matrix <math>A</math>; it is really only useful when we can construct a random <math>A</math> ourselves and make it have some properties that guarantee an RIP with very high probability.

<math>x_0</math> may not be perfectly sparse; instead, some terms may just be close to zero. In addition, we will always have some noise such that <math>y = Ax_0 + z</math> where <math>z</math> is the noise term and <math>\|z\|_2 \le \epsilon</math>. This modifies P1 to minimizing <math>\|x\|_1</math> subject to <math>\|Ax - y\|_2^2 \le \epsilon^2</math>, which relaxes the equality constraint according to the amount of noise assumed to be present. This is equivalent to minimizing <math>\lambda\|x\|_1 + \frac{1}{2}\|Ax - y\|_2^2</math>, which requires some (unknown) calibrated relation <math>\epsilon \leftrightarrow \lambda</math>. In the presence of noise and inexact sparsity, the constraints get modified a little bit. If <math>y = Ax_0 + z</math> with <math>\|z\|_2 \le \epsilon</math> and <math>\delta_{2k}(A) \le \delta_*</math>, and we denote <math>\[x\]_k</math> as the best approximation of <math>x</math> where only <math>k</math> entries are nonzero, and <math>\hat{x}</math> is the solution to the modified P1 described above, then <math>\|\hat{x} - x_0\|_2 \le \frac{C}{\sqrt{k}}\|x_0 - \[x_0\]_k\|_1 + C^\prime \epsilon</math> **(are the C and C' terms constants?)**. In other words, the deviation caused by estimating <math>x_0</math> is quantified according to how sparse <math>x_0</math> is and how big the noise level <math>\epsilon</math> is. When the noise term <math>z</math> is random, we can achieve better bounds.

Verifying that a matrix satisfies the RIP is very difficult to determine, but if the entries of the matrix are iid drawn from certain distributions, it can be argued that the matrix satisfies the RIP with high probability.

Say <math>A</math> is a matrix whose entries are iid and drawn from <math>\mathcal{N}(0,1/m)</math>. Then each column will have an <math>l^2</math> norm close to 1. <math>A</math> will have an RIP of order <math>k</math> with constant <math>\delta</math> if for every subset <math>I</math> of size <math>k</math>, <math>\sqrt{1 - \delta} \le \sigma_{\min}(A_I) \le \sigma_{\max}(A_I) \le \sqrt{1 + \delta}</math>, so we need to study the singular values of Gaussian random matrices. If <math>Q \in \Re^{p \times q}</math> is iid <math>\mathcal{N}(0, 1)</math> with <math>p > q</math> then <math>\sqrt{p} - \sqrt{q} \le \mathbb{E}\[\sigma_p(Q)\] \le \mathbb{E}\[\sigma_1(Q)\] \le \sqrt{p} + \sqrt{q}</math>. So if <math>p</math> is a lot larger than <math>q</math>, <math>\mathbb{E}\[\sigma(Q)\]</math> will have strict bounds and thus will be very well conditioned (columns nearly orthogonal). The <math>m \times k</math> matrix <math>A_I</math> will have this property, so <math>1 - \sqrt{k/m} \le \mathbb{E}\[\sigma_q(A_I)\] \le \mathbb{E}\[\sigma_1(A_I)\] \le 1 + \sqrt{k/m}</math> (**is the scaling here done correctly?**). If <math>k</math> is much smaller than <math>m</math> then this would help the RIP, but we need to consider all column submatrices and we need to see how the expectation of the singular values relates to their actual values. To determine how close the singular values are to their expectations, we need to determine their concentration of measure. We will see that for any <math>\delta > 0</math>, there exists <math>C(\delta), c(\delta) > 0</math> such that if <math>m > C(\delta)k\log(ne/k)</math> then <math>A</math> has RIP of order k with constant <math>\delta</math> with probability at least <math>1 - 2e^{-c(\delta)m}</math> - a very high probability!

A function is 1-Lipschitz if <math>\forall x, y \in \Re^D, |f(x) - f(y)| \le \|x-y\|</math>. If <math>z</math> is a <math>D</math> dimensional random vector with iid <math>\mathcal{N}(0, 1)</math> random variables, and <math>f:\Re^D \rightarrow \Re</math> is 1-Lipschitz, then <math>\mathbb{P}\[f(z) > \mathbb{E}\[f(z)\] + t\] \ge e^{\frac{-t^2}{2}}</math> and <math>\mathbb{P}\[f(z) < \mathbb{E}\[f(z)\] - t\] \ge e^{\frac{-t^2}{2}}</math>. In other words, the Lipschitz-1 function will not deviate much from its expectation. Hopefully we can apply some similar bounds to the singular values of <math>A_I</math>.

Considering the <math>j</math>-th singular value of a matrix <math>Q \in \Re^{m \times k}</math> as a function from <math>\Re^{m \times k} \righatarrow \Re</math>. For any <math>j</math>, <math>|\sigma_j(P) - \sigma_j(Q)| \le \|P - Q\|_F</math> - the singular values are 1-Lipschitz of <math>Q</math>. So if <math>Q</math>'s entries are iid <math>\mathcal{N}(0, 1)</math> then <math>\mathbb{P}\[ \sigma_{\min}(Q) < \sqrt{m} - \sqrt{k} - s \lor \sigma_{\max}(Q) > \sqrt{m} + \sqrt{k} + s \] \le 2e^{\frac{-s^2}{2}}</math>. Then the matrix <math>A_I</math> iid <math>\mathcal{N}(0, 1/m)</math> can be generated by <math>A_I = (1/\sqrt{m})Q</math>; setting <math>s = t\sqrt{m}</math> and dividing by <math>\sqrt{m}</math> gives <math>\mathbb{P}\[\sigma_{\min}(A_I) < 1 - \sqrt{k/m} - t \lor \sigma_{\max}(A_I) > 1 + \sqrt{k/m} + t\] \le 2e^{\frac{-mt^2}{2}</math>. As long as <math>m</math> is large, the failure probability is small.

If we sum the failure probabilities over all column submatrices <math>A_I</math> the bound grows to <math>\binom{n}{k}2e^{\frac{-mt^2}{2}} \le 2e^{\frac{-mt^2}{2} + k\log(\frac{ne}{k}) }</math>.

The conditions described above are satisfied provided <math>k < cm/\log(n)</math>, which will guarantee that P1 recoveres solutions with about <math>m/\log(n)</math> entries - better than <math>O(\sqrt{m})</math>.

Consider a set of <math>p</math> points <math>y_1, \ldots, y_p \in \Re^D</math>. We'd like to construct a mapping <math>f</math> to project these points to a lower dimension space <math>\Re^d</math> such that <math>d \ll D</math> and <math>\forall i, j, (1-\epsilon)\|y_i - y_j\|_2 \le \|f(y_i)-f(y_j)\|_2 \le (1 + \epsilon)\|y_i-y_j\|_2</math> - in other words, the distance between points is roughly conserved. The Johnson-Lindenstrauss lemma proves that for <math>d = \Omega(\log(p)/\epsilon^2)</math>, such a mapping exists, and that it is close to be nearly optimal to make it a random linear map. **what is omega?**

The rank of a matrix is the dimension of its column or (equivalently) its row space. The columns of a matrix <math>Y</math> lie in a <math>\operatorname{rank}(Y)</math>-dimensional linear subspace of the data space, which can be characterized by the matrix's singular value decomposition <math>U\Sigma V^* = \sum_{i = 1}^r\sigma_i u_i v_i^*</math> such that <math>U^*U = I, V^*V = I</math> and <math>\Sigma</math> is the diagonal matrix of singular values <math>\sigma_1 \ge \sigma_2 \ge \cdots \le \sigma_r \ge 0</math>.

The optimization problem <math>\min_X \|X - Y\|_F</math> subject to <math>\operatorname{rank}(X) \le r</math> is solved by <math>X_* = \sum_{i = 1}^r \sigma_i u_i v_i^*</math> (truncating the singular value decomposition sum of <math>Y</math> to <math>r</math> terms). The same solution is applicable to solving the low-rank approximation when the error is measured in the operator norm. The first <math>r</math> singular values can be found in time <math>O(mnr)</math> - polynomial time. This optimization can be used for Principal Component Analysis, which finds a best-fitting low-dimensional subspace. It can also be reformulated as the rank minimization problem <math>\min \operator{rank}(X)</math> subject to <math>\|X-Y\|_F \le \epsilon</math> - finding the matrix of minimum rank that is consistent with the observed matrix.

We seek to minimize the rank of <math>X</math> by <math>\min \operatorname{rank}(X)</math> subject to <math>\mathcal{A}[X] = y</math> where <math>y \in \Re^p</math> is an observation and <math>\mathcal{A} : \Re^{m \times n} \rightarrow \Re^p</math> is a linear map. When <math>p < mn</math>, <math>\mathcal{A}[X] = y</math> is underdetermined. <math>\mathcal{A}</math> is arbitrarily defined as <math>\mathcal{A}\[X\] = (\langle A_1, X\rangle, \ldots, \langle A_p, X \rangle)</math>.

Note that <math>\operatorname{rank}(X) = \|\sigma(X)\|_0</math>, the spectral norm <math>\|X\|_{l^2 \rightarrow l^2} = \sup_{\|w\|_2 \le 1} \|Xw\|_2 = \sigma_1(X) = \|\sigma(X)\|_\infinity</math>, and the Frobenius norm <math>\|X\|_F = \|\sigma(X)\|_2</math>. Just like P0, rank minimization is intractable in the worst case, so we'd like a complex surrogate for <math>\operatorname{rank(X)}</math>. <math>\|\sigma(X)\|_1</math> depends on the entries of <math>X</math> in a very complex way, but as we just saw applying norms to the singular values gives valid norms on the matrix itself.

A norm <math>\|\cdot\|_\diamond</math> is unitary invariant if for all matricies <math>R_1, R_2 \mid R_i^*R_i = R_iR_i^* = I</math> and all <math>X</math> we have <math>\|R_1XR_2\|_\diamond = \|X\|_\diamond</math>. Any function of <math>\sigma(X)</math> is unitary invariant because <math>\sigma(X) = \sigma(R_1 X R_2)</math> and any unitary invariant matrix norm can be written as a function of the singular values:

<math>\|X\|_\diamond</math> is a unitary invariant matrix norm iff <math>\|X\|_\diamond = f(\sigma(X))</math> for some “symmetric gauge function” <math>f : \Re^n \rightarrow \Re</math> where <math>f</math> is a norm on vectors, <math>f</math> is permutation invariant, and <math>f</math> is invariant under sign change.

Define the nuclear norm as <math>\|X\|_* = \|\sigma(X)\|_1 = \sum_i\sigma_i(X)</math>. The nuclear norm is a convex function, and may serve as a convex surrogate for <math>\operatorname{rank}(X)</math> because the nuclear norm is its convex envelope over the spectral norm ball <math>B = \{X \mid \|X\|\le 1\}</math> of matrices with operator norm at most one.

Given a vector space <math>V</math> with norm <math>\|\cdot\|_\diamond : V \rightarrow \Re</math> and inner product <math>\langle \cdot, \cdot \rangle : V \times V \rightarrow \Re</math>, define the dual norm <math>\|\cdot\|_\diamond</math> to be <math>\|x\|_* = \sup_{\|y\|_\diamond\| \le 1} \langle x, y \rangle</math>. <math>\|x\|_1</math> and <math>\|x\|_\infinity</math> are dual norms. In the space <math>\Re^{m \times n}</math> of matrices with inner product <math>\langle X, Y\rangle = \operatorname{tr}\[X^*Y\]</math>, the nuclear norm <math>\|X\|_*</math> is the dual norm of the spectral norm <math>\|X\|</math>.

The nuclear norm is convex and a good convex surrogate for the rank, and can be solved with a semidefinite program in polynomial time. To guarantee an optimal solution, we can say that <math>\mathcal{A}</math> has a rank-restricted isometry property of order <math>r</math> with constant <math>\delta</math> if <math>\forall X \mid \operatorname{rank}(X) \le r, (1 - \delta)\|X\|_F^2 \le \|\mathcal{A}\[X\]\|_2^2 \le (1 + \delta)\|X\|_F^2</math> and denote <math>\delta_r(\mathcal{A})</math> as the smallest <math>\delta</math> such that the rank-RIP holds. When <math>\delta_{2r} < 1</math> and <math>y = \mathcal{A}\[X_0\]</math> with <math>\operatorname{rank}(X_0) \le r</math>, <math>X_0</math> is the only solution to <math>y = \mathcal{A}\[X\]</math> with rank <math>\le r</math>. If <math>y = \mathcal{A}\[X_0\]</math> with <math>\operatorname{rank}(X_0) \le r</math> and <math>\delta_{5r} \le 1/10</math> then <math>X_0</math> is the unique optimal solution to the nuclear norm minimization problem. Unfortunately, some common problems do not satisfy the rank-RIP.

View each document in a collection of <math>n</math> documents as a collection of words from a dictionary of size <math>m</math>, and compute a histogram of word occurrences as an <math>m</math>-dimensional vector <math>y_j</math> where <math>y_{j, i}</math> is the fraction of occurrences of the word <math>i</math> in document <math>j</math> and <math>Y = \[y_1 | \cdots | y_n\]</math>. Assume there's a set of topics <math>t_1, \ldots, t_r</math> where each topic is a probability distribution on <math>\[m\]</math>. Each article may involve multiple topics, modeled as <math>p_j = \sum_{l = 1}^r t_l\alpha_{l, j}</math> where <math>\alpha_{1, j} + \alpha_{2, j} + \cdots + \alpha_{r, j} = 1</math>. Assume <math>y_j</math> is generated by sampling words independently at random from <math>p_j</math> and computing a histogram. If the number of words is large, then <math>y_j \approx p_j</math>, so <math>Y \approx TA</math> where <math>T = \[t_1, \ldots, t_r\]</math> and <math>A = \[\alpha_1, \ldots, \alpha_n\]</math>. LSA computes the best low-rank approximation to <math>Y</math> to use for search and indexing.

<math>m</math> users consume some of <math>n</math> products and rate them - we want to use the ratings to predict which products a user would rate well, producing a matrix <math>X \in \Re^{m \times n}</math> which represents how well a user would rate each product. Setting <math>\Omega = \{(i, j) \mid </math> user <math>i</math> has rated product <math>j\}</math>, then we observe <math>Y = P_\Omega\[X\]</math> where <math>P_\Omega\[X\](i, j) = X_{ij}</math> when <math>(i, j) \in \Omega</math> and 0 otherwise - we want to fill in <math>Y</math> to get <math>X</math>. Assuming <math>X</math> is low-rank gives <math>\min \operatorname{rank}(X)</math> subject to <math>P_\Omega\[X\] = Y</math>.

Given the problem of recovering low-dimensional structure in high-dimensional space, certain formulations can provide a solution given some constraints - most frequently that the problem can be solved with some convex optimization. A great deal of work has been expended on developing efficient algorithms for convex optimization. The relevant algorithms should be able to scale to high dimensions, with cheap iterations, many of which will be required for convergence.

We intend to solve <math>\operatorname{minimize} \lambda \|x\|_1 + \frac{1}{2}\|Ax-y\|_2^2</math> instead of <math>\operatorname{minimize} \|x\|_1</math> subject to <math>Ax=y</math>. This belongs to a general class of problems which attempts to minimize the sum <math>F(x)</math> of a smooth function <math>f(x)</math> and a nonsmooth function <math>g(x)</math> - <math>\|\cdot\|_1</math> is a nonsmooth regularizer and <math>\frac{1}{2}\|Ax-y\|_2^2</math> is a smooth fidelity term.

The simplest possible case, when <math>f(x) = 0</math>, makes <math>F</math> differentiable, and gives rise to the gradient descent method: Start with an initial estimate <math>x_0</math>, and update it via the gradient <math>x_{k+1} = x_k - t_k\nabla F(x_k)</math> until the estimate has converged. <math>t_k</math> are step sizes. If <math>\nabla F</math> is Lipschitz <math>\|\nabla F(x) - \nabla F(z)\|_2 \le L\|x - y\|_2</math> for a constant <math>L</math> we can set <math>t_k = 1/L</math>. (Note that for <math>F(x) = \frac{1}{2}\|Ax - y\|_2^2</math>, <math>\nabla F(x) = A^T(Ax - y)</math> and <math>L = \|A^TA\|</math>.) When <math>t_k = 1/L</math>, <math>F(x_k) - F(x_\star) \le \frac{L\|x_0 - x_\star\|_2^2}{2k}</math> and the number of iterations required to ensure <math>F(x_k) - F(x_\star) \le \epsilon</math> is <math>k \ge \frac{L\|x_0 - x_\star\|_2^2}{2\epsilon}</math> for some optimizer <math>x_\star</math> of <math>F</math>.

When <math>F</math> is nonsmooth, we can use the subgradient by initializing <math>x_0</math> and repeatedly choosing some <math>g_k \in \delta F(x_k)</math> and setting <math>x_{k + 1} = x_k - t_kg_k</math>. In the case of L1 minimization, we have <math>\delta F(x) = \lambda\delta \|\cdot\|_1(x) + A^T(Ax-y)</math> so <math>\lambda \operatorname{sign}(x) + A^T(Ax-y) \in \delta F(x)</math>, giving the iteration <math>x_{k+1} = x_k - (\lambda \operatorname{sign}(x_k) + A^t(Ax_k - y))</math>. However, the convergence rate is not good - if <math> \forall k, \|g_k\|_2 \le G</math> and we choose <math>t_k = \frac{\|x_0 - x_\star\|_2}{\sqrt{k}\|g_k\|2}</math> then <math>\min_{i=0, \ldots, k} F(x_i) - F(x_\star) \le \frac{G\|x_0 - x_\star\|_2}{\sqrt{k}}</math> and the algorithm converges with <math>1/\sqrt{k}</math> - which may be true in general for non-differentiable functions.

For any point z, <math>F(x) \le F(z) + \langle x - z, \nabla F(z)\rangle + \frac{L}{2}\|x - z\|_2^2 = Q(x, z)</math>, the “quadratic upper bound” for <math>F(x)</math> - a strongly convex function. If <math>\hat{x}</math> is the unique minimizer, <math>0 \in \delta_xQ(\hat{x}, z)</math> and <math>\nabla F(z) + L(\hat{x} - z) = 0 \rightarrow \hat{x} = z - \frac{1}{L}\nabla F(z)</math>. So, we use the quadratic upper bound in the gradient step as <math>x_{k+1} = \operatorname{arg}\min_x Q(x, x_k)</math> when <math>F</math> is differentiable. For non-differential <math>f</math>, we can set <math>Q(x, z) = f(x) + g(x) + \langle x - z, \nabla g(x)\rangle + \frac{L}{2}\|x-z\|_2^2</math>. This only works when <math>Q(x, x_k)</math> can be minimized efficiently.

For <math>f(x) = \|x\|_1</math>, dropping terms in <math>Q(x, z)</math> constant with respect to <math>x</math> we have <math>\operatorname{arg}\min_x Q(x, z) = \operatorname{arg}\min_x f(x) + \langle x, \nabla g(z)\rangle + \frac{L}{2}\|x - z\|_2^2</math> <math>= \operatorname{arg}\min_x f(x) + \frac{L}{2}\|x - (z - \frac{1}{L}\nabla g(z))\|_2^2</math>. Setting <math>h = z - \frac{1}{L} \nabla g(z)</math> we need to solve <math>\min_x \frac{1}{L}f(x) + \frac{1}{2}\|x - h\|_2^2</math>. Subdifferentiating with <math>f(x) = \|x\|_1</math> we have a solution <math>x_\star</math> iff <math>0 \in \frac{1}{L}\delta\|\cdot\|_1(x_\star) + x_\star - h \rightarrow x_\star \in h - \frac{1}{L}\delta \|\cdot \|_1(x_\star)</math>. Substituting in the subdifferential for the L1 norm, we have <math>x_\star(i) = h(i) - \frac{1}{L}</math> when <math>h(i) > \frac{1}{L}</math>, <math>0</math> when <math>h(i) \in \[-\frac{1}{L}, \frac{1}{L}\]</math>, and <math>h(i) + \frac{1}{L}</math> when <math>h(i) < -\frac{1}{L}</math>. Then, setting <math>L = \|A^TA\|</math> we update <math>x_{k+1}</math> in the proximal gradient descent algorithm with <math>S_{\mu/L}(x_k - \frac{1}{L}A^T(Ax_k - y))</math> where <math>x_\star = S_{1/L}(h)</math>. Each step reduces the smooth term, then applies the soft thresholding operator to decrease the L1 norm.

The soft thresholding is the “proximity operator” of the L1 norm. The proximal gradient algorithm applied to <math>F(x) = f(x) + g(x)</math> is not affected by the presence of nonsmooth <math>f(x). If <math>g</math> is convex, differentiable, and <math>L</math>-Lipschitz, and <math>f</math> is convex, then <math>F(x_k) - F(x_\star) \le \frac{L\|x_k - x_\star\|_2^2}{2k}</math>.

To try to obtain a lower bound on the amount of computation needed to solve an optimization problem (can we do better than gradient?), we can study <math>F</math> with <math>f(x) = 0</math>. A first-order method can only access <math>F</math> through its values and gradients, and has some rule to determine when to stop, at which point it outputs an approximate solution. The number of steps required is the method's complexity, and the difference between the output at the stopping point and the true output is the accuracy. We seek a lower bound, which would assert that no first order method can guarantee to optimize all <math>F</math> with some error <math>\epsilon</math> in <math>T</math> steps. It has been shown that for the set of convex, differentiable functions whose gradient is L-Lipschitz, with a minimizer <math>x_\star \mid \|x_\star\|_2 \le R</math>, the complexity is lower bounded by <math>O(1)\min(n, \sqrt{\frac{LR^2}{\epsilon}})</math>. This is better than the gradient descent method, so there may be room for improvement

The lower bound on first order methods (for smooth <math>F</math>)can be achieved by setting <math>x_0 = 0</math>, <math>y_0 = 0</math>, <math>t_0 = 1</math> and setting <math>x_k + 1 = y_k - \frac{1}{L}\nabla F(y_k)</math>, <math>t_{k + 1} = \frac{1 + \sqrt{1 + 4t_k^2}}{2}</math>, and <math>y_{k + 1} = x_{k + 1} + \frac{t_{k - 1}}{t_{k + 1}}(x_{k + 1} - x_k)</math>. This algorithm does the gradient step on a chosen set of auxiliary points calculated according to the fastest possible convergence - they are calculated with an additional “push” in the direction of the point found. The steps in this algorithm satisfy <math>F(x_k) - F(x_\star) \le \frac{2LR^2}{(k+1)^2}</math>. For nonsmooth <math>F</math>, we replace the gradient step with a minimization of <math>Q(X, z) = f(x) + g(z) + \langle \nabla g(z), x - z\rangle + \frac{L}{2}\|x - z\|_2^2</math>. This satisfies the same lower bound as the original algorithm.

elen_e6886_course_notes.txt · Last modified: 2015/12/17 21:59 (external edit)