# http://colinraffel.com/wiki/

### Site Tools

coms_6772_course_notes

# COMS W4772/6772 Course Notes

## Linear/Gaussian Modeling, PCA

### Representation (Vectorization)

Forgetting about where the data come from and what it is, we can just represent it as a long vector. Almost anything can be written as a long vector, and then we can use Machine learning algorithms on them. This assumes that the dataset can be represented as a vector in some n-dimensional space. Images can be read lexicographically, with each vector element the RGB value of each pixel. Vectorization loses structure and important features - small changes in the “semantic meaning” can cause huge changes in the vector representation, depending on the representation.

### Least squares detection

#### Sum of Square Difference

In the domain of images, in order to avoid problems caused by translation, rotation, etc, you can do least-squares detection, which takes a template and computes the sum of the squared difference for all of the rotations, scales, and translations you're interested in. The sum of square difference is a correlation The vectors can be normalized to avoid issues with intensity (lighting); this simplifies th correlation.

#### Gaussian

Minimum least squares is equivalent to maximum likelihood under a Gaussian - it's basically a Gaussian detector. We are then looking for the best $x$ (position) according to a Gaussian model. We can generalize with a covariance matrix which gives and additional source of flexibility - distances change at different rates across different dimensions. Given a collection of vectors, we are seeking the highest probability for a query vector given the data.

### PCA

We can't afford to store an arbitrary covariance matrix, so it would be nice to only worry about the “key directions” of the data. Instead of writing out all of the directions of the data, we only use a smaller number of directions (the Principal Components) which minimize the least-squares error. The eigenvectors of the covariance matrix with the highest eigenvalues are the ones which will give you the best reconstruction - they are the ones which explain the most variance. The coefficients can be computed by subtracting the mean from a query vector and taking the inner product with each eigenvector.

#### PCA problems

If the vector dimension is huge, and there are many vectors in the data, the covariance matrix is huge and finding its eigenvalue decomposition takes a lot of time. The snapshot method first subtracts the mean of all of the datapoints from each datapoint. Then, the formula for the covariance matrix simplifies because the mean of this new dataset is zero. Then, instead of computing the covariance matrix matrix, we compute the Graham matrix: the inner product of each zero-mean vector with each other vector. Then, we compute the Graham matrix's eigenvalue decomposition. The eigenvectors of the covariance matrix are then proportional to the sum over all datapoints multiplied with each Graham matrix eigenvector: $v_i \prop \sum_{j=1}^{N}x_j\~{v}_i(j)$. The Graham matrix eigenvectors tell you how to add up the dataset vectors to reconstruct the Covariance matrix eigenvectors. This approach relies on the number of datapoints being smaller than the data dimension.

After performing PCA, you no longer have an exact representation of the covariance matrix (to be used in Gaussian likelihood). So, you need to have some kind of assumption - you can assume that all of the eigenvalues are equal - which indicates that the gaussian in each non-PC direction is circular. Then, you can store an approximation to the huge covariance matrix by only storing the principal component eigenvalues and vectors and the error (eigenvalue for all other eigenvectors) and arbitrary vectors orthogonal to the principal eigenvectors.

#### PCA Bases

If we get the principal components for large datasets of some kind of signal, we get some kind of basis which is meant to reconstruct the signal well. For a large corpus of natural images, the PCs look like 2D Gabor or Wavelet functions, which is somewhat similar to what appears in the retina. The eigenvalues do not decay nearly as quickly, so it is harder to determine where to stop using PCs. Also, if the image changes a small amount, the PCs used in reconstruction tends to change a large amount. Most of the spectrum energy is wasted due to nonlinear effects.

## Nonlinear Manifold Learning

### Embedding

Data is often not in a linear subspace/Gaussian distributed. Embedding instead assumes that the data falls on the manifold, and seeks to recover (without reconstruction) the true coordinates of the data and represent them compactly. Rotation of an image (multiplication of $\mu$ by a permutation matrix $T$) is a nonlinear operation - $x_c \approx T^C\mu$. PCA would attempt to reconstruct via $x_c = \mu + c\vec{v}$ for some vector $\vec{v}$. Embedding attempts to characterize the modification by coordinates on the manifold which are a compact representation of what is happening.

#### Multidimensional Scaling

We want to find a low-dimensional embedding where the distance between points mapped to the space mimics the distances between points in the original space. A dissimilarity $d$ is a generalization of distance where $d(\vec{x}_i, \vec{x}_j) \ge 0$, $d(\vec{x}_i, \vec{x}_i) = 0$ and $d(\vec{x}_i, \vec{x}_j) = d(\vec{x}_j, \vec{x}_i)$. A distance is a dissimilarity which satisfies the triangle inequality $d(\vec{x}_i, \vec{x}_k) \le d(\vec{x}_i, \vec{x}_i) + d(\vec{x}_j, \vec{x}_k)$. The Euclidian distance is a distance metric; one over the Euclidian distance is not but it is a dissimilarity. We can assume that that for $N$ objects, we can compute a dissimilarity matrix $\Delta$ such that $\Delta_{ij} = d(\vec{x_i}, \vec{x_j})$. Given $x_1, \ldots x_N$ with dissimilarity matrix $\Delta$ under metric $d$, we seek $y_1, \ldots y_N$ with a dissimilarity matrix $D$ under dissimilarity metric $d^\prime$ such that some difference between $\Delta$ and $D$ is minimized. There is no explicit mapping between $x_1, \ldots x_N$ and $y_1, \ldots y_N$ - the $y$ points are just chosen to minimize this difference between dissimilarities. Many differences (measure of similarity between $D$ and $\Delta$) are possible.

#### Locally Linear Embedding

Instead of trying to preserve all of the distances between all of the points, only preserve some distances corresponding to points which are nearby (don't try to preserve the distance from Tokyo to Paris, only Tokyo to Shibuya). To choose the distances to preserve, find the k nearest neighbors of each point and only preserve those. Locally linear embedding figures out how to add up each point's neighbors to reconstruct it using some weights (which add up to 1) in a weight matrix for each point $W$, and then creates a new set of points (centered around the origin with identity covariance to reduce the number of solutions) which can be constructed using the same weights. Depending on the locations of the neighbors, it may be hard to reconstruct such weights.

The weights $W$ can be solved without quadratic programming by breaking up the weight matrix row-wise. Then, the error minimization can be simplified to a linear system using a Lagrange multiplier. Once the weight matrices are known, the weights for the new set of points can be estimated in a way similar to PCA. LLE can take a nonlinear manifold and unfold it into a lower dimensional representation.

#### Kernel PCA

PCA attempts to model the data with the mean of the data and eigenvectors of the covariance matrix with some coefficients. A kernel between two vectors $x_i, x_j$ is equivalent to doing an inner product with a mapped version $k(x_i, x_j) = \phi(x_i)^T\phi(x_j)$. It's a way of computing the inner product after you've mapped it to a higher-dimensional space. So, the idea is that if you have data which is not linearly Gaussian distributed, you can first map it to a space where it is and then map it back to an even lower dimensional space. So all of the dot-products in PCA get replaced with kernels. Assuming that the kernels are zero-mean, we can use the snapshot method again to get the covariance matrix of the high-dimensional data. Assuming that he eigenvectors we care about are linear combinations of the kernel-transformed data, we don't need to compute the entire covariance matrix (which is huge). We can use the Graham matrix instead. Then we can calculate the weights of the eigenvectors using eigenvector decomposition of the Graham matrix. From the unit norm constraint we can derive a constraint for the weights. We can't do reconstruction because the kernel dimension is huge. If the data is not zero-mean, you can create an alternative kernel by substituting a kernel function which includes zero mean subtraction. If you use RBF kernels, you are essentially looking very locally at each datapoint, which is very similar to LLE.

#### Semidefinite Embedding

Semidefinite embedding “stretches” the data such that the distances in each neighborhood are preserved, then applies PCA or kernel PCA. Semidefinite embedding finds an inner product (gram) matrix which maximizes non-neighbor distances and minimizes neighbor distances. Semidefinite programming is used to maximize the trace of the gram matrix of the embedding such that local isometry is satisfied (nearby points stay close together). SDE attempts to unfold, but if certain points are close together which are not meant to be neighbors it will make the algorithm incorrectly unfold the space. Also, the number of neighbors can affect the results a large amount. When you are “pulling” in lots of directions, you can unnecessarily increase the dimensionality depending on the structure of the data. It can force things apart which should be flattened together. Instead, we want to “pull” in the directions you're visualizing, and “squashing” in the directions you're not looking at - maximize the visualized eigenvalues, and minimizing the rest.

#### Minimum Volume Embedding

To ensure the dimension is reduced, we can maximize the “fidelity”: the sum of the eigenvalues in the top dimensions divided by the sum of all of the eigenvalues - equivalently, maximize the top dimension eigenvalues, and minimize the rest, the “eigengap”. A good dimensionality reduction will have large fidelity, quantified as $F(k) = \frac{\lambda_1 + \lambda_2}{\sum_i \lambda_i}$. Essentially we want most of the eigenvalue energy to be concentrated on the first few eigenvalues, which is equivalent to maximizing $\lambda_1 + \lambda_2 - \beta\sum_i\lambda_I$ for some $\beta$. We want to stretch $d$ top directions and squash the rest - maximizing the “eigengap” between the $d$ and $d + 1$ eigenvalues. The MVE algorithm attempts to make this grow until it stops growing. SDP packages solve a specific class of problems, so the problem (which is a function of the eigenvalues) needs to be formulated appropriately. If the weights of eigenvalues ($\alpha_i$) are decreasing, the objective is convex - since we are minimizing, this will result in two solutions at the boundaries. To compute the MVE, you need data points, a kernel $\mathcal{K}$, and the number of dimensions $d$ and the number of neighbors $k$. First, an affinity matrix $A_{ij} = \mathcal{K}(\vec{x_i}, \vec{x_j})$ is formed which is used to compute the connectivity (nearest neighbor) matrix $C$. Setting $K = A$, the eigenvectors $\vec{v_1}, \ldots \vec{v_N}$ and eigenvalues $\lambda_1 \ge \lambda_2 \ge \ldots \lambda_N$ are found. Then $B = -\sum_{i = 1}^d \vec{v_i}\vec{v_i}^T + \sum_{i = d + 1}^N \vec{v_i}\vec{v_i}^T$ is computed and SDP is used to find $\hat{K} = \mathrm{argmin}_{K \in \mathcal{K}} (KB)$ If $||K - \hat{K}||$ is not suitably small, repeat with $K = \hat{K}$, otherwise, perform kernel PCA on $\hat{K}$ to get $d$-dimensional $y_1, \ldots y_N$. If there are too few neighbors, the points will get very stretched apart and clustered. If there are too many, you will basically just be doing kernel PCA. You can also use a different way (not just k-nearest) way of computing the neighbors, like explicitly or using spanning trees which connects points with small distance and avoids loops. When using spanning trees a constraint has to be placed on the non-connected points in order to prevent all distances from shrinking.

### Maximum Entropy Discrimination

Maximum entropy is based on the principle of indifference or insufficient reason. Given a distribution $p(x)$, the entropy is $H(r ) = -\sum_x p(x)\log( p(x) )$.

#### Exponential Family

Maximum entropy is maximum likelihood for exponential family distributions. Exponential family describes integrals/ML/MAP which are “nice” mathematically, defined in natural form $p(x | \theta) = h(x)e^{\theta^TT(x) A(\theta)}$ - the model (with parameters $\theta$) and data ($x$) interact only through a dot product (linear) and $A(\theta)$ is convex. The Gaussian can be expressed in natural form, for example, with $h(x) = \frac{1}{2\pi}$, $T(x) = \left[ x; x^2 \right]$, $\theta = \left[ \frac{\mu}{\sigma^2}; -\frac{1}{2\sigma^2} \right]$ and $A(\theta) = \frac{\mu^2}{2\sigma^2} + \log(\sigma) = \frac{\theta_1^2}{4\theta_2} - \frac{1}{2}\log(-2\theta_2)$. $A(\theta)$ is the Laplace transform of $h(x)$ with $A(\theta) = \log \int_x h(x)e^{\theta^T T(x)}dx$ when $T(x)$ is the identity. The first derivative of $A(\theta)$ is $E_{p(x | \theta)} \{T(x) \}$, and the second derivative is $\mathrm{Cov}_{p(x|\theta)} \{T(x)\}$. The covariance is positive semidefinite, so the function is convex, so you can find its minimum. The maximum likelihood estimate simplifies to $A^\prime(\theta) = \frac{1}{N}\sum_i T(x_i)$ - the average of the data has to be the expected average of the model. A “sufficient statistic” summarizes the data via simple averages of finite functions of it $T(x)$ - given tons of points, we only need to know a few statistics to generate the data.

##### Conjugate Priors

Exponential family distributions have a dual or conjugate distribution over its parameters defined by $p(\theta | X) = j(n, X)e^{R(\theta)^T T(x) - nA(\theta)$ which lies in the exponential family; for a Gaussian with mean, it's a Gaussian, for a Gaussian with covariance, it's an inverse Wishart, Bernoulli's conjugate inverse is a Beta distribution, etc.

##### E-families

An exponential family has probability distribution of the form $e^{H(x) + \theta^T(x) - A(\theta)}$. We can then characterize distributions by their $H(x)$ and $A(\theta)$ and the domain of $\theta$.

#### Maximum Entropy Theory

The maximum entropy approach in learning finds a distribution $p(x)$ which satisfies some constraints but also maximizes the entropy of $p(x)$ (makes it as smooth as possible). Essentially, given a small amount of data, we expect the simplest/smoothest possible explanation (indifference/insufficient reason). The entropy is $Hâ„— = -\sum_x p(x)\log(p(x))$, which measures how “spread” the probability over the values $x$ is. For example, the distribution of the weather in NYC is low entropy (sometimes sunny, sometimes cloudy, sometimes snowy…) than it is in LA (where it is often sunny). Higher entropy means more bits needed to communicate. The uniform distribution has the highest possible entropy, $p = \mathrm{argmax} H(p )$. However, if we add constraints there is some other distribution with maximum entropy, like an expectation constraint $E_{p(x)} \{f_i(x)\} = \sum_x p(x)f_i(x) = \alpha_i$ - forcing the expected value of $f(x)$ to be a constant. $f(x) = x$ constrains the mean, $f(x) = x^p$ constrains the higher $p$ moments, etc.

##### Minimum Relative Entropy

Given functions $f_1(x), \ldots f_N(x)$ with associated constants $\alpha_1, \ldots \alpha_N$ (with $f_0(x) = 1; \alpha_0 = 1$ always) these constraints select out a “piece” $\mathcal{B}$ of all possible distributions, over which we will maximize entropy $p = \mathrm{argmax}_{p \in \mathcal{B}} H( p)$. This is equivalent to finding the distribution which is the “closest” to the uniform distribution, or the one which minimizes the relative entropy to the uniform distribution. We can also minimize the relative entropy to some other distribution we like $q(x)$, such as zero-mean unit-variance Gaussians. The relative entropy is formulated by the Kullback-Leibler Divergence, $KL(p||q) = \sum_x p(x) \log\left(\frac{p(x)}{q(x)}\right)$. When $q(x)$ is uniform, this simplifies to $KL(p||q) = \sum_x p(x) \log(p(x)) - \sum_x p(x)C \rightarrow -H( p)$. We can represent the constraints on $p$ with Lagrange multipliers to obtain $\mathcal{L} = KL(p||q) - \sum_i \lambda_i (\sum_x p(x) f_i(x) - \alpha_i) - \gamma( \sum_x p(x) - 1)$. KL is convex and and the constraints are linear, so we can solve by taking the derivative and setting to zero, giving $p(x) = \frac{1}{Z} q(x)e^{\sum_i \lambda_i f_i(x)}$, where $Z$ is given by the normalization. Once the $\lambda_i$s are given, we can solve this with $Z(\lambda) = \sum_x q(x)e^{\sum_i \lambda_i f_i(x)}$, which puts the maximum entropy solution in exponential family form where the Lagrange multipliers behave like $\theta$ parameters, giving $\mathcal{L} = -\log(Z(\lambda)) + \sum_i \lambda_i \alpha_i$. We can get $\lambda$ by maximizing $\mathcal{D} = -\log(Z(\lambda)) + \sum_I \lambda_i \alpha_i$ which is a concave dual Lagrangian. So when doing maximum entropy, you design some constraints on the problem (in the form of functions) rather than the form, and you are given the distribution which maximizes the entropy given those constraints. This allows you to get a distribution for some random variable given some constraints (potentially many) using convex optimization.

##### Maximum Entropy Inequalities

When the constraint functions $f_i(x)$ correspond to inequalities like $\sum_xp(x)f_i(x) \ge \alpha_i$, the Lagrange multipliers have to be positive (for greater than) or negative (for negative) only. When the constraints are inequalities, they carve out a piece of $\mathcal{B}$ which is a convex hull. Minimizing the relative entropy finds the distribution closest to the convex hull via a projection. Some exponential family distribution will exist for constraints from real-world data.

##### SVMs and regularization

Given training examples $\{X_1, \ldots X_T\}$ and binary labels $\{y_1, \ldots y_T\}$, and a discriminant function (for example linear) $L(X; \Theta) = \theta^Tx + b$ where the parameters are $\Theta = \{\theta, b\}$, we seek to maximize the distance between a line which separates the data by label such that the distance of the data to the line is maximized. Support vector machines only find a single best parameter set $\Theta^\ast$ although other solutions may be almost as good. So instead of giving the best one we can give a distribution $P(\Theta)$ with a high probability weight to the good solutions. The probability distribution will give us the same solution if the probability distribution is $P(\Theta) = \delta( \Theta - \Theta^\ast )$. So to find the best distribution, we seek to maximize $\int P(\Theta) [y_tL(X_t; \Theta) - 1]d\Theta] \ge 0, \forall t$ - on average, over this distribution, you get the answer right. The regularization penalty function is then replaced by a minimization of $KL(P||P_0)$ which favors small $\Theta$s. This is exactly a maximum entropy problem - finding the distribution over SVMs which is close to our prior and satisfies the constraints. Here, the feature functions are over $\Theta$. We want a distribution which classifies everything correctly.

The solution to the maximum entropy formulation will be of the form $P(\Theta) = \frac{P_0(\Theta)}{Z(\lambda)} e^{\sum_t \lambda_t (y_t L(X_t; \Theta) - 1))$. We are trying to find distributions over models instead of distributions over data. We seek to maximize the dual negated log of the partition function $Z(\lambda)$ which is the integerl of $P(\Theta)$. We then have a constraint per data point. We need to specify a prior $P_0(\Theta)$ using a zero mean identity-covariance Gaussian (written as $N$) on a linear vector and another zero-mean Gaussian on the bias: $P_0(\Theta) = N(\theta | 0, I)N(b, 0, \sigma^2)$ where $L(X; \Theta) = \theta^T X + b$. The latter Gaussian is preferring values of $b$ close to zero. The partition function then simplifies to a function over $\lambda$s. If you compute the negative logarithm, you get a quadratic expression of the $\lambda$s. In order to have a non-informative prior (any $b$ is OK) we need $\sigma^2$ to go to infinity, which would indicate a very fat Gaussian prior for $b$. This makes the negative log partition function go to negative infinity unless $\sum_t \lambda_t y_t = 0$, so we add that as another constraint. We can use any distribution we want on the prior.

##### Nonlinear Classifiers

We can use maximum entropy for nonlinear discriminant functions $L(X; \Theta)$ without using the kernel trick. For example, a generative-style classifier discriminant function from the ratio of two probability models $L(X; \Theta) = \log( \frac{P(X | \theta_+)}{P(X | \theta_-)} + b$ where the two probability models are whatever. If we tried to solve a model like this with SVM, we'd get a non-convex program and it would break. With Maximum Entropy Discrimination, the linearity of $P(\Theta)$ is still there so the problem is still convex. Assuming that the probability models in the ratio are in the exponential family $p(X | \theta) = e^{A(X) + X^T\theta - \K(\theta)}$, they have conjugate $P(\theta | \chi ) = e^{\~{A}(\theta) + \theta^T \chi - K(\chi)}$, which we will use as the prior. We need the partition function to be analytically computable. This gives a classifier which assigns probabilities but also accurately labels all of the data. This technique for classification is always better than maximum likelihood Gaussians because each Gaussian has no idea what the other one is doing and so they don't build good decision boundaries.

### Log-linear Models

#### Generative vs. Conditional vs. Discriminative

Generative learning takes a bunch of labeled data $\mathcal{D} = (x_t, y_t)_{t=1}^T$ sampled from some unknown distribution $P(x, y)$ and family of functions with some parameters $p_\theta(x, y)$ which minimizes some cost function or maximizes some objective (e.g. likelihood $\prod_{t = 1}^T p_\theta(x_i, y_i))$, which given an input $x$ produces $\hat{y} = \mathrm{arg}\max_y \frac{p_\theta (x, y)}{\sum_y p_\theta(x, y)}$ (learning from the data to recreate the data). Discriminative learning takes a family of functions $f_\theta(x)$ parameterized by $\theta$, and finds $\theta$ by minimizing error $\sum_{t = 1}^T l(y_i, f_\theta( x_i ))$ for some loss function $l$ in order to produce a label $\hat{y} = f_\theta(x)$ given $x$ (just get the label right - probability of each doesn't matter). Logistic regression and conditional random fields perform conditional learning; they try to find $p_\theta(y|x)$ characterized by $\theta$, a good distribution on the outputs (probability of each output given an input). $\theta$ is found by maximizing the conditional likelihood $\prod_{t = 1}^T p_\theta(y_i | x_i)$, which given $x$ produces $\hat{y} = \mathrm{arg}\max_y p_\theta(y | x)$.

#### Generative models

Minimum relative entropy minimizes the distance (using KL divergence) of the probability distribution to some desired distribution subject to some constraints which specify which distributions are allowed. The solution is $p(y) = h(y)e^{\theta^Tf(y) + \mu^Tg(y)}/Z(\theta, \mu)$ which also looks like an exponential family model. The dual is the negative log-partition which is the maximization of the solution's negative log. An exponential family model can be viewed as a log-linear model over discrete $y \in \Omega$ where $|\Omega| = n$, which (without dependency on $x$) is $p(y | \theta) = \frac{1}{Z(\theta)} h(y) e^{\theta^T f(y)}$ with parameters $\theta$, features which map $y$ to some vector in $\Re^d$, with prior (your favorite distribution) $h$ and a partition function which ensures that $f$ normalizes.

##### Example

Given iid data $y, \ldots, y_T; y_i \in \{0, 1\}$ we can write the distribution as an exponential family distribution $p(y | \theta) = \theta^y(1 - \theta)^y$. We want to find $\theta$ (coin side proportion) by maximizing the likelihood which gives $\theta \frac{1}{T} \sum_t y_t$.

#### Converting to conditional

A binary logistic regression for the example above computes the probability $p(y = 1, x | \mu) = \frac{1}{1 + e^{-\mu^T\phi(x)}}$ with $p(y = 0, x | \theta) = 1 - p(y = 1 | x, \theta)$. Now to find the best possible $\theta$ we maximize the conditional likelihood $L(\mu) = \sum_{t = 1}^T p(y_t | x_t, \mu)$, whose maximization is a convex optimization problem. Logistic regression is a log-linear model.

#### Log-linear models

Log linear-models look like an exponential family, except $Z, h, f$ (partition function which ensures $p$ normalizes, prior which indicates the probability of each class, and feature functions) can all depend on $x$: $p(y|x, \theta) = \frac{1}{Z(x, \theta)} h(x, y) e^{\theta^Tf(x, y)}$. This is different than an exponential family with some concatenation of $y$ and $x$ because $Z$ can't depend on $y$. The parameters are still $\theta$, the functions take $f : \Omega_x \times \Omega_y \rightarrow \Re^d$ map $x, y$ to a vector, the prior $h$ is a fixed non-negative measure. A prediction is just $\hat{y} = \mathrm{arg}\max_y p(y | x, \theta)$. For a multi-class problem, $y = \{1 , \ldots, n\}$. If $\phi(x) \in \Re^k$ then the feature function $f \in \Re^{kn}$ can be defined by concatenating a bunch of copies of $\phi(x)$ as $f(x, y) = \[\delta[y = 1]\phi(x)^T \delta[y = 2]\phi(x)^T \ldots \delta[y = n]\phi(x)^T]^T$. This simplifies to binary logistic regression when $n = 2$, $h(x, y) = 1$. We are trying to draw $n$ lines for $n$ classes such that as we get close to a line we become equally likely for the classes that the line divides. The parameters of the lines are concatenated into the $f(x, y)$.

#### Conditional random fields

CRFs are basically log-linear models when $n$ (the size of the output space) is particularly big, which makes $f(x, y)$ look like a lot of zeros, then a $\phi(x)$, then more zeros. For example, if the input is sentences in English and the output are sentences in French, there are tons of possible sentences in French. Even though it's a huge high dimensional problem, it's still convex. Since $f(x, y)$ is mostly zeros we don't need to store it. We can only do this for huge spaces when the $p(y)$ can be decomposed as a product of a lot of probabilities. For a CRF $p(y | x_j, \theta) = \frac{1}{Z_{x_j}(\theta)} h_{x_j} (y) e^{\theta^T f_{x_j}(y))}$ with maximum conditional log-likelihood function $J(\theta) = \sum_{j = 1}^t \frac{h_{x_j}(y_j)}{Z_{x_j}(\theta)} + \theta^T f_{x_j}(y_j)$. Regularized adds a penalty $-\frac{t\lambda}{2} ||\theta||^2$.

##### Majorization

To train a CRF we maximize this log-likelihood function, traditionally using “majorization” which is like EM. Majorization is a way of minimizing some cost function $\theta^\star = \mathrm{arg}\min_\theta C(\theta)$ when it has no closed form solution. Instead we use a surrogate $Q$ with a closed form update, which is a bounded $C$ which is a smooth approximation always above $C$.

These methods are not as fast as gradient descent approaches; new majorization methods are even faster. The part that is hard to minimize is the log part of the log-linear gradient… we like quadratic functions. We can upper bound $Z(\theta)$ by a quadratic. The quadratics cause the estimate move more quickly and thus the solution to converge more quickly. With this approach, as long as your feature vectors are not huge (they are bounded by some radius r) within a relatively small number of steps.

We want to maximize the log likelihood of $p(y | x, \theta)$ which is the same as minimizing the sum of the log partition functions $Z$ plus a linear term in $\theta$. This is a concave optimization problem. The derivative with respect to $\theta$ gives a difference between feature vectors at the the true label $f(x_t, y_t)$ minus the expected feature vector under an expectation of the labels $f(x_t, y)p(y | x_t, \theta)$ (when you sum across the data you get the right feature vector). At each step we update $\theta$ by moving it in a small step along its derivative.

Given a current $\theta$, instead of adding a small step in the direction of the derivative, we can update with a simplified version of the gradient which only looks at a single fixed data point picked randomly at each iteration. This can make the algorithm faster.

### Graphical Models and Structured SVMs

#### Linear Chains

In a parts-of-speech classification task (1 million words in sentences labeled with one of 45 parts of speech), each sentence of say seven words has $45^7$ possible labels. If we assume that the words have independencies, we can say the probability splits up into a chain $p(y_1, \ldots y_n | x, \theta) = p(y_1 | x, \theta) \prod_{i = 2}^n p(y_1 | y_{i-1}, x, \theta)$. This just looks at the pairwise probability instead of the probability of the entire chain. This makes training faster because you don't have to do gradient ascent on all of the possible outcomes. A conditional random field will maximize the conditional likelihood rather than the likelihood (as in EM). A graphical model will split up the probability, and can be written as the product of the transition tables for the cliques. Instead of enumerating all $n$ label possibilities, it can be cast as a graphical model by building a junction tree.

#### Graphical Model

A graphical model is a graph representing a distribution $p(Y)$ where you look at all the cliques in the graph and make smaller tables for each clique. Once you build a junction tree, you pass messages between the cliques in the junction tree. With the Collect algorithm, you only need $TM^2$ work instead of $M^T$. You only need the sum of the cardinalities of the clique's cardinality. Graphical models and bayes nets are the classical ways to learn multi-class and structured output problems, but these methods are not discriminative (they learn $p(x, y)$ instead of $p(y | x)$).

##### Doing it quickly

To do structured prediction, we need to be able to quickly compute $\mathrm{arg}\max_y \theta^T f(x, y)$ and $\sum_y p(y | x) f(x, y)$ - try to find a $y$ without searching through all possible $y$s and also find the average feature (the gradient) quickly. We can break up the space of $y$ by assuming it splits into many conditionally independent sums.

#### Structured prediction and HMMs

In a hidden markov model, we have a chain of a hidden variable $y$'s values which output observed values of the variable $x$. The space of $y$ is huge so we'd like to treat it as a graphical model - we need to be able to compute $p(x_1, \ldots x_T)$ given $x_1, \ldots x_T$, $y_1, \ldots y_T$ given $x_1, \ldots x_T$, and $\theta$ given $x_1, \ldots x_T$ using the Baum-Welch, alpha-beta or Viterbi algorithm. We use the junction tree algorithm because it works on all graphical models.

##### Inference

Given some variables, can we predict others? Eg, given that a patient has headache and fever, do they have the flu? We are trying to predict $p(Y_q | Y_f)$, where $Y_q$ are queried variables and $Y_f$ are the “findings” variables. The classical way to do it is $\frac{p(Y_q, Y_f)}{p(Y_f)} = \frac{\sum_{Y_u} p(Y_q, Y_f , Y_u)}{\sum_{Y_u} \sum_{Y_q} P(Y_q, Y_f, Y_u)}$ but this involves summing over tons of variables which is impractical.

#### Junction trees

Early networks for solving this were logic nets, but they gave rise to inconsistencies based on which path you take. Instead we use Bayesian networks which assign probabilities to the network. Then, the giant probability table breaks up into lots of smaller tables. We can then use an undirected graphical model instead where instead of thinking of conditionals of nodes given their parents, we can break the graph up as positive valued tables (not strict probability functions). A directed graph can be converted to an undirected one using moralization. We can build a junction tree from the graphical model by connecting each of the cliques into a “mega node”. Then, given this clique tree, we can add separator nodes which add variables at the intersection of each clique. Junction trees can only be a tree (no loop) when the moral graph is triangularized by adding edges so that there are no cycles with four or more nodes in the graph. Finding the optimal triangulation is NP-hard so we use a suboptimal triangulation which has too many cliques. Junction trees must also satisfy the running intersection property - on the unique path which connects clique $V$ to clique $W$, all other cliques share nodes in $V \cap W$. We can use Kruskal's algorithm to achieve this property, which maximizes the cardinality of the separators. Given a valid junction tree, we can get the valid $V, \mu, \sigma$. The junction tree algorithm sends a message from one clique to another and updates the clique with that information (probability table), and updates the separator. The tables need to be updated so that when you fix over all variables except one for one table and all variables except the same one in another, they have to be equal. The junction tree algorithm ensures that this happens.

##### Junction tree argmax

The junction tree algorithm can be used to compute any “distributed” function - one which breaks up as a the sum of product of little functions over the input variables. Maximization also has this property; we can take the max over each individual function in the product. The functions in the product must all be positive.

#### Structured SVMs

Tons of classes.

##### Large-Margin SVM

A large margin SVN allows for some training error. Given a bunch of inputs $x_i$, we output true or false. With soft margin, you find the fattest possible line (margin) with some error. You then minimize the amount of error in the objective, using a weight on the sum of the error $C$. This gives a quadratic program. You can also work with maximizing the dual. If we make the SVM's classification boundary pass through the origin, then it's possible that we may not choose the best line. But, if we add a 1 to the end of our data vectors $x$ then it's equivalent to adding an offset to the line except that we are penalizing the size of the margin, so you still may not get the best line. This makes the constraints simpler.

##### Multi-class SVM

Instead of a binary choice, we view the problem as a list of all possible answers, a multi-class classification task. Each output is then one of the possible classes. We'd like to be able to handle problems with exponentially many classes. One approach would be to train a classification boundary for each class then choose the classifier with the highest score. In training, you minimize the sum of the sizes of all of the margins. For each example, we require that the score assigned for its correct class is greater than all other classes. This gives (number of data points - 1)*(number of labels) constraints (for now we assume no class imbalance). We can also include slack so that each class doesn't have to totally beat everyone else. Instead of solving for each margin, we can concatenate them into one long margin and replace each $x$ with $\phi(x, y = i)$ a matrix of all zeros rows except the $i$th column is $x$. Now, each constraint just takes the concatenated margin vector and multiplies it against $\phi(x_i, y_i)$. Originally we had $w_ix_i$, now we have $w^\top \phi(x_i, y_i)$. We can also add slack here.

##### Label loss

We can also change the cost depending on what the classification is because some misclassifications may be a bigger deal than others (tiger-lion vs tiger-house cat). These costs are user-supplied depending on which classification errors seem more or less important - we are forcing the margin to be artificially larger or smaller. These are typically symmetric. If we set the cost (artificial margin growth) for classifying something as itself to zero then we can iterate over all examples. This can help with class imbalance somewhat. The average error we're getting is bounded by the average amount of slack used. This gives a generic structural SVM which is slow because there are so many constraints.

##### Efficient structural SVM

If we reformulate the problem to only have a single slack variable, we add a ton more constraints where each requires that each pair of training data points are classified correctly with the appropriate slack. We calculate how well we did on the training data on average (across all labelings) and require that it's greater than the average cost minus the single slack variable. The requirement is that the score of the correct labeling must beat all other labelings on average. The original problem optimizes with $n(k-1)$ constraints, 1-slack has $k^n$ constraints. Having a single slack variable allows all of the constraints to mess up as much. Originally each individual labeling for each variable had to beat other labels, now the true labeling of the whole dataset must beat the all other labelings of the whole datasets.

Since there are so many labelings, this is not efficient, but if we are smart we note that only a few of the constraints are active - we look at a given labeling and add as constraints all of the incorrect labelings. The cutting plane algorithm guesses an answer, computes the labeling, and then adds only the constraint which it is violating the most and to the quadratic program. This constraint is the one which needs the most slack. If the constraints are added in a smart way, we may not need to solve $k^n$ constraints. There's a theorem which guarantees that we will never need $n^k$ constraints and bounds the number of constraints to a small number. The hope is that if you put in the most violated constraint, you will satisfy other constraints for free. The most violated constraint is found using the junction tree algorithm.

##### Joint feature map for trees

A weighted context free grammar learns weights for (for example) transforming sentences and words into different parts of speech. The score of each tree is the sum of its weights. The highest scoring tree gives a weight vector which helps in classification. You can initialize the efficient structural SVM approach with the maximum likelihood estimate, and then iterate to improve it. By allowing for some slack, you avoid making the model fit the data exactly.

##### Maximum Relative Margin

It's not just about the fattest line - it's about the size of the line relative to the radius of your data. You can get a better generalization guarantee because you're not assuming that all dimensions are equally important. You are instead looking at the dimensions where the data actually changes and trying to separate those. For maximum relative margin, you add a constraint that requires that the classification score is limited within a range. Otherwise it's the same as the original SVM calculation. Setting the range to be very large is just like an SVM, and as you make the boundary smaller the error typically goes down until the quadratic program can no longer be solved. This can be extended to structured SVM.

### Kernels for SVMs

To compute an SVM, you need to compute a graham matrix which stores the pairwise similarity $k$ between points. Kernels replace the vector dot products in the Gram matrix, which allows for nonlinear decision boundaries. However, in the most general sense $k$ can be any symmetric positive-semidefinite comparison between two objects, even though traditionally they take two vectors and output a scalar.

#### String Kernels

Given two strings, we want to generate a scalar indicating how similar they are. We can define features $\phi$ which count the number of times substrings of length 1 and 2 (for example) appear in the strings (length $26\times26 + 26$). Then the scalar for two strings is $\phi(X)^\top \phi(X^\prime)$. In order to avoid computing huge vectors when using substrings which are longer, dynamic programming can be used to compute the kernel efficiently.

#### Probability kernels

Given two probability distributions on $x$ given by $p = p(x | \theta)$, $p^\prime = p(x | \theta^\prime)$, we'd like to define a kernel between them. A kernel can be converted into a distance using the Euclidian formula. A good candidate for distance would be the KL-divergence but it's not symmetric. Probability distributions lie on some strange manifold and the KL-divergence is in some sense calculating the distance over that manifold between the distributions. So, we should approximate the KL divergence using something like the Mahalanobis distance, which linearly approximates the manifold at some point $\theta^\star$, taken for example at the ML point. This distance can be converted back to a kernel, where the probability distributions are replaced with vectors which look like the gradient of the log probability, which are multiplied with the Fisher information matrix. A lot of approximations have been made, so it may be better to just start with a different divergence - Jensen-Shannon or symmetrized KL - but these are still not metric distances. The Hellinger divergence is actually a distance and can give a kernel without any approximation. The Hellinger divergence gives the Battacharyya kernel, the integral of the product of square roots of distributions. This can be generalized to the probability product kernel where the square root is replaced by any power.

We can use an SVM on things which are not simply vectors - images, text, continuous probability distributions, etc. We want, given these objets, to be able to predict a label just like with vectors. One option is to discretize these objects into vectors in some way. However, you can generalize more than this and potentially get better performance because you can have the nonlinearity “built-in”. You can also go from a vector representation to a probability distribution by estimating the probability distribution from the given data. The hope is that the probability distributions will tell you more about the data and improve accuracy. We can then think of classifying a “bag” of vectors rather than vector by vector. You have to pick a parametric form/distribution family. Without enough points, the distribution can be hard to estimate. If we use Gaussian distributions and the mean is the data itself, then it's equivalent to an RBF kernel, except the variance of the Gaussian can vary. You can also use discrete distributions, or exponential family kernels. The kernel for exponential family functions are easy to compute and only depends on the convex cumulant-generating function $K(\theta)$.

##### Kernelized Gaussian Product

Computing the Gaussian of a bunch of vectors may not make sense in the original space, so we may want to embed it in another manifold and compute the Gaussian in that space. In the higher dimensional space we can compute a much more complicated Gaussian which better fits the data.

##### Mixture Product Kernels

Instead of modeling each bag as a single distribution, we can model it as a mixture of distributions. The kernel for a mixture is then the average (weighted by the mixing proportions) of the kernels between all of the possible pairings.

### Feature and Kernel Selection

If you're trying to diagnose some disease, there may be many different tests you can perform (a bunch of measurements). Some of the measurements may be more or less expensive. You'd like to diagnose the disease using only a few measurements which best predict the diagnosis.

#### Feature Selection

Feature selection focuses on the “interesting” dimensions of the data, which makes the cost of obtaining the data smaller and also reduces the complexity/storage of the data. You can also improve generalization by having fewer, better dimensions. A simple way to do this would be to throw away features and choose the feature set which gives the SVM with the largest margin (or relative margin). However, this would require an exponential number of tests because you need to consider all possible subsets of dimensions - instead we'd like to do it efficiently (and jointly) with SVM estimation.

##### Filtering

Filtering eliminates features before you even train the classifier (as a pre-processing). The Fisher Information Criterion can be used to compute a score for each feature, and you keep the feautres with the best score. The Fisher score calculates the difference of the mean of the features for each feature for the positive and negative example and divides by the sum of the variance for the positive and negatve examples. So, in one dimension, you are considering fitting the feature with a Gaussian and seeing how far apart their mean is and how skinny the spread is. It assumes the data is Gaussian distributed in each dimension, and it also only measures how linearly separable the data is - it could be thrown out even if you have a kernel that could separate the dimensions. Finally, it doesn't consider that sometimes features are redundant - it only considers one feature at a time.

To address this last issue, you can use a Pearson correlation coefficient which computes how correlated each pair of features are (one cell of the covariance matrix). This also relies on a Gaussian assumption on the data. To avoid the Gaussian assumption, you can use the Kolmogorov-Smirnov test which computes the cumulative density function over both classes then over each single class. The score is then the biggest gap between the cumulative density function for all of the data and the CDF for one of the classes. It also only compares one feature at a time.

##### Wrapping

Wrapping finds and eliminates features by evaluating the accuracy after you train the SVM - it puts the feature selection into the classification problem by classifying feature vectors multiplied by binary vectors which “select” features. Since more features usually improves training accuracy, you usually pre-specify the number of features you want to use. We want to optimize the margin and radius bound and minimize VC dimension while keeping classification error. You can compute the radius of your data through a quadratic program which tries to weight all of the datapoints (where the weights sum to one) to put them all as far away from the origin as possible. If you assume the feature selection vectors are continuous, you can minimize the VC dimension by taking the derivatives with respect to the margin and radius and moving towards the minimum (against the gradient of the kernel). You are doing continuous optimization over the switches (feature selector vector) even though they are binary. The gradient with respect to the kernel can be used to compute the gradient with respect to the radius and the gradient with respect to the inverse margin.

#### Kernel Selection

If we're given a set of possible kernels to use in an SVM, we'd like to pick which would work the best. Picking a single kernel would involve trying an SVM for each kernel, but picking a (weighted) linear combination of kernels would require training many SVMs. A simpler approach would be to compute the alignment of the kernels, which computes a normalized inner product of Gram matrices. If the alignment is high, the kernels are doing basically the same thing. A good kernel is one which aligns well with the label matrix $y y^\top$ - the best kernel could be the one whose Gram matrix aligns best with the label matrix. Equivalently you can maximize the kernel Gram matrix with the label matrix while constraining the kernel's self inner product to 1 and ensuring it is positive semidefinite.

### Semi-Supervised Learning

In semi-supervised learning, you are given both labeled and unlabeled data. Getting labeled data is often the most time-consuming part of the learning process, so the use of unlabeled data can be helpful. Performance as measured on this unlabeled data is called the transductive performance. Transduction finds a large margin region between data, respecting those labels it has.

#### Semi-Supervised SVM

Given some labeled training data $(x_1, y_1), \ldots (x_l, y_l)$ and lots of unlabeled data $x_{l+1}, \ldots x_n$, we essentially look for a large margin separation boundary which separates the labeled data and is also large-margin with respect to the other data. This gives rise to an objective function which includes unknown labels and is non convex.

##### S3VMlight

S3VMlight is a way to heuristically solve the non convex objective function. First, a regular SVM is trained using the labeled data. Then, this SVM is used to label the unlabeled points and the SVM is retrained with a low cost assigned to violating the constraint for the unlabeled example. Labels are then swapped for the unlabeled data such that the objective function decreases. Over iterations, the cost of violating the constraint for the unlabeled data is increased. The algorithm stops when no labels have been swapped. This is an annealing approach which converges in finite time (very slowly). Sometimes unlabeled data can hurt performance.

#### Generative Model

Given a small amount of training data, we want to fit a generative model (assuming all of the data is generated in some way). If we have unlabeled training data too, we can use it to better fit the generative model. With just unlabeled data, we need to maximize the log-likelihood over both the parameters and the labels. With both, we optimize over a weighted combination of the log likelihood of the labeled and unlabeled data. However, we still may converge to a local minima and we have to pick a model to f it to.

#### Graph-based method

A graph-based method uses a similar assumption to a locally linear embedding - similar instances should have the same label. We want to minimize some loss function of the predicted label and the actual label plus a term to control the model complexity - for every combination of points we want to minimize a loss on their labeling weighted by some amount corresponding to how similar they are. The term to control the model complexity is called the graph regularizer. The similarity matrix must be constructed in some way (some kernel).

Given iid samples from an unknown distribution $p(x, y)$ over $x \in \Omega$ and $y \in Z$ organized as the labeled data $X_L \union Y_L = \{(x_1, y_1, \ldots (x_l, y_l)\}$ and the unlabeled data $X_U = \{x_{l+1}, \ldots x_{l+u} \}$, we'd like to approximate the missing labels that largely agree with the true labels for $X_U$. Graph-based semi-supervised learning constructs a sparse weighted graph from $X_L \union X_U$ which is used via some technique to generate an approximation for $Y_U$.

##### Graph Construction

Given $n$ data points, we construct an adjacency matrix $A$ using some kernel to denote which data points are connected and what their weights are. We can also compute a distance matrix $D$ from the kernels. The matrix ($D$ or $A$) is then pruned by producing a matrix $P$ which is multiplied against the adjacency matrix to produce a sparse graph. One way to produce $P$ is to set $P_{ij} = 0$ whenever $D_ij \le \epsilon$ but this can produce disconnected graphs. We can also use a symmetrized k-nearest neighbor where each point chooses its $k$ closest points, but this can yield a fully connected graph. Instead we can choose the minimal set of closest points but this can yield points with no neighbors. Instead we can use a b-matching graph where each node must have $b$ neighbors and we minimize the amount of distance between neighbors. Instead of symmetrizing with the max, we add a symmetric constraint on the optimization. This can be solved somewhat quickly. To get weights, we can use an RBF (Gaussian) kernel weighting on the distance, or set the weights so that each point is reconstructed by its neighbors. Given the weighted graph, we can use some kind of contagion to label the unlabeled data points based on what their neighbors are labeled.