User Tools

Site Tools


STAT W4400 Course Notes

Maximum Likelihood

Gaussian Distributions

A Gaussian distribution in one dimension is defined as $g(x, \mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma} \mathrm{exp}\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$. $\mu$ denotes the mean, which specifies the location of the maximum of the function. $\sigma$ is the standard deviation which specifies the “spread” of the Gaussian. As a probability distribution, the value of $g$ denotes the probability that a Gaussian random variable will take on some value. Computing $(x - \mu)/\sigma$ measures the deviation of $x$ from its expected value in units of $\sigma$. In more dimensions, $x$ and $\mu$ are vectors, and the distribution becomes $g(x, \mu, \Sigma) = \frac{1}{\sqrt{2\pi}\mathrm{det}(\Sigma)} \mathrm{exp}\left( -\frac{1}{2} \langle (x - \mu), \Sigma^{-1}(x - \mu) \rangle \right)$. The matrix $\Sigma$ controls the “elongation” of the bump in all dimensions.

Parametric Models

If we observe datapoints and assume they are distributed according to some distribution, we would like to be able to estimate the parameters of the model. A model $\mathcal{P}$ is a set of probability distributions, indexed by parameter $\theta$ from the parameter space $\mathcal{T}$ which specify a single distribution from the model: $\mathcal{P} = \{P_\theta | \theta \in \mathcal{T}\}$. For a Gaussian, the parameters are the mean and standard deviation. A model is parametric if the number of parameters is finite and independent of the number of datapoints. We can write a parametric model as a density function $p(x | \theta)$, the probability of a datapoint given the parameters of the model.

Maximum Likelihood Estimation

If we have a set of datapoints $x_1, \ldots, x_n$ and a parametric model $\mathcal{P} = \{p(x|\theta) | \theta \in \mathcal{T}\}$ we'd like to find the parameter $\theta$ which indexes the distribution which is the “best” in some sense. Maximum likelihood looks for the distribution under which the data was most likely to occur - that is, the distribution which assigns the highest probability to the data: $\hat{\theta}_{ML} = \mathrm{arg}\max_{\theta \in \mathcal{T}} p(x_1, \ldots, x_n | \theta)$. This parameter is called the maximum likelihood estimator.


A common assumption for Maximum Likelihood Estimation is that the datapoints are independent and identically distributed - that is, they are separately drawn by randomly sampling the same distribution. We can then decompose the joint probability of all of the datapoints as $p(x_1, \ldots, x_n | \theta) = \prod\limits_{i = 1}^n p(x_i | \theta)$.

Analytical solution

A criteria for the maximum likelihood estimator is that $\nabla_\theta \left(\prod\limits_{i=1}^n p(x_i | \theta) \right) = 0$, that is, the derivative with respect to the parameters is 0 - we are at a maximum. A huge product rule computation is cumbersome, so we can compute the log probability to turn the product into a sum. Because the logarithm is monotonic, it doesn't change the location of the maximum. We only care about the location of the maximum, not the value, so we can safely take the logarithm of the joint probability and its maximum will still be at the same location: $\hat{\theta}_{ML} = \mathrm{arg}\max_\theta \sum\limits_{i=1}^n \log( p(x_i | \theta) )$. The analytical criteria for a maxima is then $0 = \sum\limits_{i = 1}^n \nabla_\theta \log( p(x_i | \theta) ) = \sum\limits_{i = 1}^n \frac{\nabla_\theta p(x_i | \theta)} {p(x_i | \theta)}$.

Example: Multivariate Gaussian, fixed covariance

Our model is the set of all Gaussians in $\mathbb{R}^d$ with fixed covariance $\Sigma$, $\mathcal{P} = \{g(x | \mu, \Sigma) | \mu \in \mathbb{R}^d \}$. Our parameters are just values of $\mu$ in $\mathbb{R}^d$. We can then solve

\begin{align} 0 &= \sum\limits_{i = 1}^n \nabla_\mu \log( g(x_i | \mu, \Sigma)) \\ &= \sum\limits_{i = 1}^n \nabla_\mu \log \left( \frac{1}{\sqrt{2\pi}\mathrm{det}(\Sigma)} \mathrm{exp}\left( -\frac{1}{2} \langle (x - \mu), \Sigma^{-1}(x - \mu) \rangle \right) \right) \\ &= \sum\limits_{i = 1}^n \nabla_\mu \log \left( \frac{1}{\sqrt{2\pi}\mathrm{det}(\Sigma)} \right) + \nabla_\mu \log \left(\mathrm{exp}\left( -\frac{1}{2} \langle (x - \mu), \Sigma^{-1}(x - \mu) \rangle \right) \right) \\ &= \sum\limits_{i = 1}^n \nabla_\mu \left( -\frac{1}{2} \langle (x - \mu), \Sigma^{-1}(x - \mu) \rangle \right) \\ &= -\sum\limits_{i = 1}^n \Sigma^{-1}(x_i - \mu) \end{align}

Multiplying by $-\Sigma$ on both sides gives $0 = \sum\limits_{i = 1}^n (x_i - \mu)$ so our maximum likelihood parameter is $\hat{\mu}_{ML} = \frac{1}{n}\sum\limits_{i = 1}^n x_i$.

Example: Multivariate Gaussian

Now, our model is $\mathcal{P} = \{g(x | \mu, \Sigma) | \mu \in \mathbb{R}^d, \Sigma \in \Delta_d \}$ where $\Delta_d$ is the set of positive definite matrices in $\mathbb{R}^{d \times d}$. The parameter space is then $\mathcal{T} = \mathbb{R}^d \times \Delta_d$. From above, the maximum likelihood mean does not depend on $\Sigma$, so we can use it to estimate the maximum likelihood covariance. Solving $\sum\limits_{i = 1}^n \nabla_\Sigma \log \left(g(x_i | \hat{\mu}_{ML}, \Sigma)\right) = 0$ we have $\hat{\Sigma}_{ML} = \frac{1}{n} \sum\limits_{i = 1}^{n} (x_i - \hat{\mu}_{ML})(x_i - \hat{\mu}_{ML})^\top$.


Classification is a supervised learning problem where, given a set of training data with some labels, we would like to be able to predict labels for unseen testing data. For example, GMail classifies email messages as SPAM or Important using machine learning. More formally, we have datapoints $X = \{x_1, \ldots, x_n\} \in \mathbb{R}^d$ each of which belongs to one of $K$ categories, called classes. A label for some datapoint $x_i$ is denoted $y_i \in [K]$, giving $Y = \{y_1, \ldots, y_n\}$. We assume that the datapoints are characterized by some joint distribution of the datapoints $X$ and labels $Y$ with some density $p(x, y)$. The quantity $p(x|y = k)$ is the class-conditional distribution of the class $k$, which denotes the probability distribution of each group. The only information we are given about $p$ is a set of example measurements and labels called the training data, denoted $(\tilde{x}_1, \tilde{y}_2), \ldots, (\tilde{x}_n, \tilde{y}_n)$. A classifier is then a function which maps from a point in $\mathbb{R}^d$ (a measurement) to one of the labels. To create a classifier, we will first try to perform with high accuracy on the training data, and hope it generalizes well to any unseen data.

Simple Classifier: Gaussian Discriminant Analysis

Given training data with labels $y_i$, a simple classifier would fit a Gaussian distribution to the training data each class separately using maximum likelihood (this is the training step). Then, we can assign a label to any point according to which Gaussian assigns it the highest probability. If the possible classes are $\{-1, 1\}$, then we estimate the densities $g(x|\mu_{+1}, \Sigma_{+1})$ and $g(x|\mu_{-1}, \Sigma_{-1})$ and assign label $1$ when $g(x|\mu_{+1}, \Sigma_{+1}) > g(x|\mu_{-1}, \Sigma_{-1})$ and $-1$ otherwise. We can impose more constraints on the distributions to reduce the number of parameters. For example, a popular assumption is that $\Sigma_{+1} = \Sigma_{-1} = \Sigma$. This simplifies the decision-making rule to setting the label to $+1$ when $2(\mu_{-1} - \mu_{+1})\Sigma^{-1} x < \mu_{-1}^\top \Sigma^{-1} \mu_{-1} - \mu_{-1}^\top \Sigma^{-1} \mu_{+1}$; that is, a hyperplane. Otherwise, the decision boundary may be curved. This classifier assumes that the data is Gaussian-distrubuted in each class. Furthermore, usually the points which affect the class prediction most are those right on the class boundary, so classifiers usually try to estimate a good class decision boundary instead of a good class distribution.


To quantify how well our classifier is working, we typically define a loss function which computes a nonnegative scalar value (the loss) given the true class label and class output. If the classifier is denoted $f$, we can write the loss as $L(y, f(x))$. We expect that $L = 0$ when $f(x) = y$. One common loss function is the zero-one loss function $L^{0-1}(y, f(x))$ which is $1$ when $f(x) = y$ and $0$ when $f(x) \ne y$ - it just indicates a misclassification, to count errors.


The risk of a classifier is the expected value of the loss function. This is in part a way to cope with the fact that the loss may fluctuate randomly. We are more concerned with classifying correctly those points which we are likely to see. Specifically, we define the risk of the classifier $f$ as $R(f) = \mathbb{E}_p[L(y, f(x))] = \int L(y, f(x))p(x, y)dxdy = \sum\limits_{y = 1}^K \int L(y, f(x))p(x, y)dx$. However, we do not know the probability of the data $p(x, y)$ so we must estimate it using the training data. This gives us the empirical risk, $\hat{R}_n(f) = \frac{1}{n} \sum\limits_{i = 1}^n L(y_i, f(x_i))$. In this framework, the best classifier is the one which minimizes the risk, which depends on the chosen cost function.

Naive Bayes Classifier

If we know the probability of each label given a data point $p(y | x)$ then we can define a classifier as $f(x) = \mathrm{arg}\max_{y \in [K]} p(y|x)$. Bayes rule gives $p(y|x) = p(x|y)p(y)/p(x)$. Because $p(x)$ is independent of $y$ we can instead write $f(x) = \mathrm{arg}\max_{y \in [K]} p(x|y)p(y)$. Because $p(x, y) = p(y|x)p(x)$ we can write the risk for a zero-one loss as $R(f) = \sum\limits_{y = 1}^k \int L^{0-1}(y, f(x))p(y|x)p(x)dx = \int p(x) \left( \sum\limits_{y=1}^k L^{0-1}(y, f(x))p(y|x) \right) dx$. If we can minimize the quantity in the summation for every $x$, the entire integral will be minimized. We can instead maximize

\begin{align} & 1 - \sum\limits_{y=1}^k L^{0-1}(y, f(x))p(y|x)\\ =& \sum\limits_{y = 1}^k p(y | x) - \sum\limits_{y=1}^k L^{0-1}(y, f(x))p(y|x) \\ =& \sum\limits_{y = 1}^k\left(1 - L^{0-1}\left(y, f(x)\right)\right)p(y | x )\\ \end{align}

Note that $1 - L^{0-1}\left(y, f(x)\right)$ is just the opposite of $L^{0-1}$, which will only be $1$ for one of the possible labels. The summation then just becomes $p(y = k | x)$. So, the optimal classifier for $L^{0-1}$ is looking for $\mathrm{arg}\max_{[K]} p(y|x)$. We need to learn the distribution $p$. We can estimate it using maximum likelihood, but usually we need to make extra assumptions about the distribution. One very simple and common assumption is that the individual dimensions of a data vector are conditionally independent given the label $y$. Estimating the class prior $p(y)$ is simple - just count the number of times $y$ appears in the training data and divide it by the number of training examples. Estimating $p(x|y)$ takes more work. Under the naive bayes assumption, $p(x|y) = \prod\limits_{i = 1}^d p_j (x_i|y)$. In this setting, we simply estimate probability distributions for each dimension.

Example: SPAM Filtering

To filter SPAM from an inbox, we can treat it as a binary classification problem with labels “spam” or “email”. The datapoints $x$ are $d$ dimensional vectors, such that $x_j$ denotes the number of occurrences of word $j$ in a vocabulary of $d$ words. We seek a classifier $f$ which, given a vector $x$ will output a label $\{+1, -1\}$ which denotes whether the email is spam or not. We assume that the training samples and labels $(\tilde{x}_1, y_1), (\tilde{x}_2, y_2), \ldots, (\tilde{x}_n, y_n)$ are drawn i.i.d. from some distribution $p(x, y)$. In the Naive Bayes setting, we assume that the number of occurrences of one word is independent of the number of occurrences of any other word.

Linear Classification

Linear classifiers use hyperplanes to separate the training data into classes. A hyperplane in $\mathbb{R}^d$ is a linear subspace of dimension $(d-1)$ which contains the origin. Hyperplanes can be specified according to their normal vector $v_H$ (the vector perpendicular to the hyperplane) by $H = \{x \in \mathbb{R}^d | \langle x, v_H \rangle = 0\}$. We can compute the distance of a point $x$ to the hyperplane by projecting it onto the direction $v_H$ by $\langle x, v_H \rangle / \|v_H\|$. We can determine which side of a plane a point $x$ lies on by measuring its angle to the plane $\theta$ and computing $\cos(\theta)$. Hyperplanes can be shifted by some bias vector $w = c \cdot v_H$ to create an affine hyperplane $H_w$ given by $H_w = H + w$. In this setting, we can determine which side of the hyperplane a point $x$ lies on by $\mathrm{sgn}(\langle x - w, v_H\rangle ) = \mathrm{sgn}(\langle x, v_H - c\|v_H\|^2)$. Typically we use a unit vector for the normal vector so that $\|v_H\| = 1$. Our affine hyperplane is therefore specified by $v_H \in \mathbb{R}^d$ and $c \in \mathbb{R}$. We can instead parametrize the hyerplane as $z = (-c, v_H) \in \mathbb{R}^{d + 1}$ and compute $\mathrm{sgn}(\langle x, v_H \rangle - c) = \mathrm{sgn}\left(\langle \binom{1}{x}, z \rangle \right)$.

In linear classification, we attempt to construct an affine hyperplane such that $f_H(x) = \mathrm{sgn}(\langle v_H, x\rangle - c)$, that is, the classification function checks which side of the hyperplane the datapoint falls on. This assumes that our training data is linearly separable by class. To learn a good linear classifier, we attempt to minimize the risk over the space $\mathcal{H}$ of all hyperplanes in $\mathbb{R}^d$. We then choose $f = \mathrm{arg}\min_{f \in \mathcal{H}} R(f) = \mathrm{arg}\min_{f \in \mathcal{H}} \mathbb{E}_p [L(y, f(x))]$. But, we can't compute this quantity unless we know the distribution $p$. We can approximate the risk and use empirical risk minimization $f = \mathrm{arg}\min_{f \in \mathcal{H}} \frac{1}{n}\sum\limits_{i = 1}^n L(y_i, f(x_i))$. If we use 0-1 loss, we are minimizing the number of errors on the training data. The solution region is the set of vectors $z$ for which the training error is zero. If the data is linearly separable, the solution region is a cone in $\mathbb{R}^{d + 1}$; otherwise it's empty.

Perceptron Algorithm

If we minimize the empirical risk with respect to the $z$ with 0-1 loss it will be piecewise constant. This is hard to minimize, it would be easier if our empirical risk was convex. In this case, we can use an algorithm (like gradient descent) which finds the unique minimizer. We can instead minimize an approximation of the empirical loss. We hope that we can find a cost function which we can substitute for the empirical loss which does not change the solution space. For example, the Perceptron is defined as $C_p(f) = \sum\limits_{i = 1}^n \mathbb{I}(f(\tilde{x}_i) \ne \tilde{y}_i) \left| \langle z, \binom{1}{\tilde{x}_i} \rangle \right|$. Minimizing the Perceptron cost function gives us a Perceptron classifier.

To minimize the cost function, we use gradient descent, an iterative algorithm which is defined by $z^{(k + 1)} = z^{(k)} - \alpha(k) \nabla C_f(z^{(k)})$ for iterate $k$. $\alpha(k)$ is the step size or learning rate which can depend on $k$; common choices are $\alpha(k) = 1$ or $\alpha(k) = 1/k$. The gradient of the Perceptron cost is given by $\nabla_z = \sum\limits_{i = 1}^{(k)} \mathbb{I}(f(\tilde{x}_i) \ne \tilde{y}_i)(-f(\tilde{x}_i))\binom{1}{\tilde{x}_i}$. If, for example, $\alpha(k) = 1$ and we are only misclassifying a single training point $\tilde{x}_i$, then gradient descent will set $z^{(k + 1)} = z^{(k)} - \binom{1}{\tilde{x}_i}$. Substituting with $z^{(k)} = \binom{c^{(k)}}{v_H^{(k)}}$ we have $v_H^{(k + 1)} = v_H^{(k)} - \tilde{x}_i$. This will effectively rotate the hyperplane to correctly classify $\tilde{x}_i$. The gradient descent algorithm will converge on the solution region.

Algorithms which use all of the training data in optimization are called “batch” algorithms. Instead of using the entire training set, we can use one sample at a time to update the hyperplane. Using the single-point update equation derived above, we can simply set $z^{(k + 1)} = z^{(k)} + \sum\limits_{i = 1}^n \tilde{y}_i \binom{1}{\tilde{x}_i}$, called the “fixed increment single-sample Perceptron”. If the data is linearly separable, this algorithm will terminate after a finite number of steps with a valid $z$.

Maximum Margin Classification

In a linear classification problem with two linearly separable classes, there may be many hyperplanes which separate the two classes. For the perceptron algorithm, the chosen hyperplane depends on the locations of the training points and the initialization. However, assuming that the points from each class come from some distribution, it may be that some hyperplanes are “better” in that they generalize better for unseen data points. If we assume a Gaussian distribution for each class, we can choose a hyperplane which evenly distributes the probability density on either side of it. However, this requires explicitly assuming a distribution for each class. A maximum margin classifier instead assumes that the best hyperplane is the one which is the furthest away from all of the points in both sets, which requires no distribution assumption.

First, a maximum margin classifier substitutes the datapoints in each class with the data's convex hull (the smallest convex set which includes all the points). In this way, it ignores the effect of the points “inside” each set. More specifically, given the data set $C$ we write $\mathrm{conv}(C)$ for the convex hull of the data with extreme points $\{e_1, \ldots, e_m\}$, which are the vertices and elements from the original set $C$. Every point $x$ in a convex set (such as $\mathrm{conv}(C)$) can be represented as a convex combination of the extreme points, with some weights which sum to $1$: $x = \sum\limits_{i = 1}^m \alpha_i e_i, \sum\limits_{i = 1}^m \alpha_i = 1$. A hyperplane separates two points if and only if it separates their convex hulls. The distance between a point $x$ and a set $A$ is the Euclidian distance to the closest point in $A$, given by $d(x, A) = \min_{y \in A} \|x - y\|_2$. If $A$ is a hyperplane, then the closest point will be the one perpendicular to $x$. The margin of a classifier hyperplane $H$ is the shortest distance between the plane and any point in either class's $X_1, X_2$ elements, given by $\min_{x \in X_1 \cup X_2} d(x, H)$, or equivalently, the smallest distance to either convex hull.

We can require that an affine hyperplane have a margin by requiring that $\langle v_H, x \rangle - c > 1$ for one class and $< -1$ for the other, rather than simply by looking at the sign of $\langle v_H, x \rangle - c$. The length of $v_H$, given by $\|v_H\|_2$, determines how large the margin actually is - it functions as our notion of length. We can then try to minimize $\|v_H\|_2$ (which increases the margin) while still linearly separating the training data. More specifically, given training points $(\tilde{x}_i, \tilde{y}_i)$ with $\tilde{y}_i \in \{-1, 1\}$, we solve $\min_{v_H, c} \|v_H\|_2 \; \mathrm{s.t.}\; \tilde{y}_i(\langle v_H, \tilde{x}_i \rangle - c) \ge 1, i = 1, \ldots, n$. When classifying, we still simply look at the sign of $\langle v_H, x \rangle - c$. A classifier which solves this problem is called a Support Vector Machine (SVM). The “support vectors” refer to the extreme points which are those that affect the choice of the hyperplane - the choice of hyperplane is very sensitive to the support vectors.

Solving the SVM training problem is difficult because there is a constraint which is a function. We can instead solve the dual problem, $$\max_{\alpha \in \mathbb{R}^n} W(\alpha) = \sum\limits_{i = 1}^n \alpha_i - \frac{1}{2}\sum\limits_{i, j = 1}^n \alpha_i\alpha_j\tilde{y}_i\tilde{y}_j\langle\tilde{x}_i, \tilde{x}_j\rangle \;\mathrm{s.t.}\; \sum\limits_{i = 1}^n \tilde{y}_i \alpha_i = 0, \alpha_i \ge 0, i = 1, \ldots, n$$ The ability to instead solve the dual problem follows from the fact that the closest distance between a point $x$ and a set $A$ is the maximum over the distances between $x$ and all hyperplanes which separate $x$ and $A$. In other words, we look at all hyperplanes tangent to the set and choose the one which maximizes the distance to $x$, which also happens to be the minimal distance between $x$ and $A$, or $d(x, A) = \mathrm{sup}\; d(x, H)$ for all hyperplanes $H$ tangent to $A$. So, we can find the maximum margin hyperplane by finding the points in each set closest to each other and choosing the hyplerplane orthogonal to the line joining them. Because we can represent any point in a convex set as a convex combination of its extreme points, we can try to minimize the distance between the two sets by $\min \|u - v\|_2^2$ where the minimization is over all points $u$ and $v$ in the convex hulls of the training sets for each class $X_1$ and $X_2$. Denote $C_i$ as the set of indices such that $\tilde{y}_i = 1$ and $C_2$ similarly for $\tilde{y}_i = -1$. As discussed above, we can represent each point $u$ and $v$ in $X_1$ and $X_2$ respectively as the sum of extreme points, writing $$u = \sum_{i \in C_1} \alpha_i \tilde{x}_i, v = \sum_{i \in C_2} \alpha_i \tilde{x}_i$$ Now, we can minimize the distance between the two sets as $$\min_{\alpha_1, \ldots \alpha_n} \|\sum_{i \in C_1} \alpha_i\tilde{x}_i - \sum_{i \in C_2} \alpha_i \tilde{x}_i\|_2^2 \;\mathrm{s.t.}\; \sum_{i \in C_1} \alpha_i = \sum_{i \in C_2} \alpha_i = 1, \alpha_i \ge 0$$ We can manipulate the objective as follows: \begin{align*} \|\sum_{i \in C_1} \alpha_i \tilde{x}_i - \sum_{i \in C_2} \alpha_i \tilde{x}_i \|_2 &= \|\sum_{i \in C_1} y_i\alpha_i\tilde{x}_i + \sum_{i \in C_2} y_i \alpha_i \tilde{x}_i \|_2^2\\ &= \|\sum_{i = 1}^n y_i\alpha_i\tilde{x}_i \|_2^2 \\ &= \left\langle \sum_{i = 1}^n y_i\alpha_i\tilde{x}_i, \sum_{i = 1}^n y_i\alpha_i\tilde{x}_i \right \rangle \\ &= \sum_{i, j = 1}^n y_i y_j \alpha_i \alpha_j \langle \tilde{x}_i, \tilde{x}_j \rangle \end{align*} Note that minimizing this objective is equivalent to maximizing the objective in the dual, up to the constant $\sum \alpha_i$ and the scale $1/2$. After solving the dual problem, we will be given a set of optimal scales $\alpha_i^\ast$. We can then compute $v_H^\ast = \sum\limits_{i = 1}^n \tilde{y}_i \alpha_i^\ast \tilde{x}_i$. We can find the optimal offset $c^\ast$ by finding the point in the training sets for each class which is closest to to the hyperplane $\langle v_H^\ast, x \rangle$. The optimal $c^\ast$ is then the mean of these two distances, computed as $$c^\ast = -(\max\limits_{i \in C_1} \langle v_H^\ast, \tilde{x}_i \rangle + \min\limits_{i \in C_2} \langle v_H^\ast, \tilde{x}_i \rangle)/2$$ For classification, we then simply compute $f(x) = \mathrm{sgn}(\langle v_H^\ast, x \rangle - c^\ast)$.


A soft margin classifier is a maximum margin classifier which allows for some points to disobey the margin or hyperplane boundary. This is motivated by the fact that if you have data which is not linearly separable, the maximum margin classifier will not converge. The traditional maximum margin classifier also is very dependent on the location of the support vectors. In other words, if we vary our training data only slightly, the solution will be different - so it is not robust, it is brittle. We use slack variables to permit points to be on the wrong side of the margin (or hyperplane) by adding a cost to the objective function. We replace the training rule with $\tilde{y}_i(\langle v_H, \tilde{x}_i \rangle - c) \ge 1 - \xi_i$, where $\xi_i$ is the slack variable for the $i$th datapoint. If the slack variables get larger than $1$, then the point is on the wrong side of the hyerplane. We include a cost for the slack in the objective function as $$\min_{v_H, c} \|v_H\|^2 + \gamma \sum_{i = 1}^n \xi^2 \;\mathrm{s.t.}\; \tilde{y}_i(\langle v_H, \tilde{x}_i \rangle - c) \ge 1 - \xi_i, \xi_i \ge 0, i = 1, \ldots, n$$ The parameter $\gamma$ determine how much it “costs” to allow for some slack. As $\gamma$ gets larger, the cost gets larger and it approaches the original SVM. $\gamma$ is usually set by cross-validation. The dual problem becomes $$\max_{\alpha \in \mathbb{R}^n} W(\alpha) = \sum\limits_{i = 1}^n \alpha_i - \frac{1}{2}\sum\limits_{i, j = 1}^n \alpha_i\alpha_j\tilde{y}_i\tilde{y}_j(\langle\tilde{x}_i, \tilde{x}_j\rangle + \frac{1}{\gamma}\mathbb{I}(i = j)) \;\mathrm{s.t.}\; \sum\limits_{i = 1}^n \tilde{y}_i \alpha_i = 0, \alpha_i \ge 0, i = 1, \ldots, n$$ The classification rule stays the same - slack just causes a different hyperplane to be estimated.

Optimization Methods

An optimization problem is formulated as trying to find a value $x$ which minimizes a function $f : \mathbb{R}^d \rightarrow \mathbb{R}$. This minimization can be constrained - that is, only certain values of $x$ are permissible. The feasible set (the values of $x$ which satisfy the constraints) is often formulated using a bunch of equations or inequalities. We can write this form of optimization problems as $\min_x f(x) \;\mathrm{s.t.}\; g(x) \ge 0$.

When searching for a minimum of a function, two important types of minima are possible: Local or global. A local optimum is a point where the function increases in all directions around that point - it is the smallest point within some neighborhood around it. A global minimum is the smallest local minimum - there is no smaller point in the domain of $f$. If we have an algorithm to find a minimum of the function, we can't necessarily trust it - it may not be global. Analytically, we define a minimum of a function $f$ as a point where $\nabla f(x) = 0$ and $H_f(x)$ (the Hessian where $H_{i, j} = \frac{\partial f}{\partial x_i \partial x_j}(x)$ is positive semidefinite. The first condition requires that the point is a maximum or minimum; the second requirement requires that the point is a minimum. To find a minimum, we start with some point $x_0$ and find a sequence $x_0, \ldots, x_m$ such that $f(x_m)$ is minimum. At the $i$th step, we decide in which direction and how much to step depending on properties (gradient, Hessian) of the function at our current point $x_i$.

Convex Functions

A function is convex if any line between two points in the function's domain lies above the function. The function is strictly convex if the Hessian is positive definite for all $x$. If $f$ is convex, then any local minimum is a global minimum. So, $\nabla f(x) = 0$ is a sufficient condition for a global minimum. There may be multiple global minima if the function is not strictly convex.

Gradient Descent

Gradient descent is a numerical algorithm for optimization which chooses a step size based on the gradient of the function at the current estimate. The gradient of the function will always point away from the direction of decrease, so we step in the opposite direction of the gradient. We terminate when the gradient becomes small (indicating that we have reached a local minimum). Gradient descent is based on the assumption that if the derivative is large, you are far away from the minimum - because the step (which tries to reach the minimum) will be large when the gradient is large. We can write the $n$th step as $x_{n + 1} = x_n - \nabla f(x_n)$. Gradient Descent exhibits linear-time convergence to the solution.

Newton's Method

The original Newton's Method algorithm looks for roots of a function - that is, points where it is zero. It does so by $x_{n + 1} = x_n - f(x_n)/\nabla f(x_n)$. Newton's method can be used to get the value of a function at any point - for example, we can use Newton's method on the equation $x^2 - a = 0$ to compute the value of $\sqrt{a}$ because the value of $x$ which satisfies the equation is $\sqrt{a}$. We can use Newton's Method for minimization by using it on the first derivative - this will find a root of the first derivative, which corresponds to a minimum of the function. This involves the step $x_{n + 1} = x_n - H_f(x_n)^{-1}\nabla f(x_n)$. Newton's copes for some of the issues with gradient descent by normalizing the step by how quickly the derivative changes. However, it is only possible to use Newton's method when you can compute the Hessian's inverse, and the Hessian is only invertible when it is positive definite - $f$ has to be strictly convex. It is computationally expensive to compute the Hessian and its inverse, which you have to do at every step (even though the number of steps is constant across dimensionality). A pseudo-Newton method computes the Hessian once, which loses all of the nice guarantees of Newton but may help in some cases. Multiplying by the Hessian changes the slope and the direction of the step, which corrects based on the way the function curves near the current $x_n$. The effect of using the Newton method is quadratic convergence.


Usually, an optimization problem has constrants. Equality constraints are formulated as $\min f(x) \;\mathrm{s.t.}\; g(x) = 0$. This defines a feasible set of points - those that satisfy the constraint, such as $G = \{x | g(x) = 0\}$. If the function which describes the constraining is nice and smooth, then the set $G$ is also nice and smooth. We essentially only look at the values of $f(x)$ at the points $x$ which satisfy the constraints - the function $f_g(x)$ which is the intersection of the surface of $f$ with the surface defined by the constraint. Importantly, it may now be that the gradient of the function is non-zero at the minimum over the constraint set because the true minimum of the function may not be in the constraint set. However, the gradient is always orthogonal to contour lines (lines where the function does not change at all). So, to perform constrained optimization, we (orthogonally) project the gradient onto the constraint set which decomposes the gradient into a vector in the direction of (tangent to) the constraint set called $(\nabla f)_g$ and a vector orthogonal to the constraint set. The projection of the gradient onto the constraint set will be zero when $f$ is minimized over the constraint set. Essentially, we can think of the constraint set as a lower-dimensional space we are optimizing over. When the projection of the gradient onto this space is zero, we have found the constrained minimum. If we compute the gradient of $g$ at any point, it is orthogonal to the contour line. So, the projection of $\nabla f$ onto the feasible set will be zero when it points in the same direction as $\nabla g$. We can formulate this mathematically by requiring that $\nabla f + \lambda \nabla g = 0$ and $\nabla g = 0$.

Inequality constraints are expressed as $g(x) \ge 0$. We can decompose the feasible points as the points which are strictly less than $g$ and the points which satisfy $g(x) = 0$. If we are in the interior of $g(x) < 0$ then we are basically back in the unconstrained case. So, in the interior we can just require $\nabla f = 0$. On the boundary, as in the equality constraint case, we require that $\nabla f = \lambda \nabla g$. However, now the sign of $\lambda$ matters - if $f$ increases as it leaves the constraint set, it is a minimum; if it decreases as it leaves the constraint set then it's a maximum. By convention, we formulate this criterion for a minimum on the boundary as $\nabla f = -\lambda \nabla g \;\mathrm{s.t.}\; \lambda > 0$. Formally, we can combine these cases to require that $\nabla f = -\lambda \nabla g \;\mathrm{s.t.}\; g(x) \le 0, \lambda = 0 \;\mathrm{if}\; x \in \mathrm{in}(G), \lambda > 0 \;\mathrm{if}\; x \in \partial G$. We'd like to get rid of the case distinction, which we can do by instead require that $\lambda g(x) = 0$ and $\lambda \ge 0$ because we are either requiring that $\lambda = 0$ or $g(x) = 0$. So, our minimum criteria become $\nabla f(x) = -\lambda \nabla g(x) \;\mathrm{s.t.}\; \lambda g(x) = 0, g(x) \le 0, \lambda \ge 0$. The first two conditions form a system of $d + 1$ (potentially nonlinear) equations. These are called the KKT conditions. In summary, we have added two constraints and an extra variable in return for actually figuring out how to minimize $f(x)$ (which we weren't originally able to do). The KKT become both necessary and sufficient conditions when $G$ is convex. If the constraint set $G$ is nonconvex, there may be many points which satisfy the $\nabla f = -\lambda \nabla g$ condition.

Instead of minimizing the constrained optimization problem, we can minimize the function $f(x) + c\mathbb{I}_{[0, \infty)} (g(x))$ where $\mathbb{I}_{[0, \infty)}$ is a function which is $0$ when its argument is less than zero and infinite otherwise. So, in the disallowed region, the function becomes infinite. The problem is that $\mathbb{I}$ is a step function, so we can't apply a gradient algorithm. We can approximate the step function by a smooth function (with a slope), called a “barrier function”. An example of a barrier function is $\beta_t(x) = -\frac{1}{t} \log (-x)$. In this example, as $t$ gets large, we get closer and closer to $\mathbb{I}_{[0, \infty)}$. So, instead of minimizing $f(x)$ subject to $m$ constraints $g_i$ we minimize $f(x) + \sum\limits_{i = 1}^m \beta_i (x)$. As a general strategy for constrained optimization, you convert constraints into a solvable problem using Lagrange multipliers, then you relax the constraints using barrier functions, then you use some kind of descent method to solve the unconstrained optimization problem. This is called an interior point method.

Multi-Class Problems

In many problems of interest, there are more than two classes. Some architectures lend themselves well to this scenario, such as Gaussian Discriminant Analysis, trees, and ensemble methods. For hyperplane-based methods, it is not as clear how to generalize to more than two classes. Popular approaches include one-versus-one classification, one-versus-all (or -rest), and multiclass discriminants.


In one-versus-all classification, you train a separate classifier per classes, each of which determines whether the query point is or is not in some class $k$. You can then choose the correct class based on the decisions of each classifier. However, you can come up with a contradiction if two classifiers say the point is in their class. For one-versus-one classification, you train a separate classifier for each pair of classes (which results in many classifiers). Class is then chosen by majority vote. Again, you can have regions where the outputs of classifiers are contradictory. Often, the contradictory regions are either small or not problematic in practice, so these methods can work well. However, it is inherently a problematic technique.

Multiclass Discriminants

For some classifiers, we may be given some kind of confidence metric for a classification. We can use this to determine which class to choose in a multiclass problem. In a linear classifier, the usual rule is to classify based on the sign of $\langle x, v_H \rangle - c$. Instead, we can combine classifiers before computing signs and define $g_k(x) = \langle x, v_k \rangle - c_k$ for each classifier $k$. Now we can just pick the classifier with the largest value as $f(x) = \mathrm{arg}\min_k \{g_k(x)\}$. A larger value of $g_i$ indicates that the point lies “further” into the class. This doesn't work for SVMs because these “distances” are not comparable across classifiers. It's possible to combine the SVM classifiers for each class into a single optimization problem, but the training time scales quadratically in the number of classes. So, the SVM is usually used in a one-against-all formulation for multiclass problems.


So far, we have only considered a linear decision boundary. In some cases, the decision boundaries between classes is (highly) non-linear. We'd like to be able to take a linear classifier (like a Support Vector Machine) and turn it into a linear classifier. The SVM uses a scalar product as a measure of similarity between two points, and the distance from a point to a hyperplane. The scalar product is linear, so the SVM is linear. If we use a nonlinear function instead of the scalar product, we can make a nonlinear classifier. A scalar product can be regarded as a function of two vectors in $\mathbb{R}^d$ and maps them to a scalar. So, maybe we can use a nonlinear function $k : \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$ instead. When a function of this kind satisfies a set of conditions set out by Mercer's theorem, we call it a kernel. In practice, we do not construct a function and prove it is a kernel; instead, we use one of a set of well-studied kernels. A popular kernel is the the RBF kernel which is defined as $$ k_{RBF}(x, x^\prime) = \mathrm{exp}\left(-\frac{\|x - x^\prime\|_2^2}{2\sigma^2}\right) $$ for some $\sigma \in \mathbb{R}_+$. With $x^\prime$ fixed, the RBF kernel is (proportional to) a spherical Gaussian in $\mathbb{R}^d$ with mean $x^\prime$ and standard deviation $\sigma$.

Which functions are kernels?

We call a function $k : \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$ a kernel when there is some function $\phi : \mathbb{R}^d \rightarrow \mathcal{F}$ which maps into some space $\mathcal{F}$ which has a scalar product $\langle \cdot, \cdot \rangle_{\mathcal{F}}$ such that $k(x, x^\prime) = \langle \phi(x), \phi(x^\prime) \rangle_{\mathcal{F}}$. So, $k$ is a kernel when you can interpret it as a scalar product in some other space. When using the kernel, we are essentially using a classifier which is linear in the space $\mathcal{F}$. Usually, we want the data to be separable in $\mathcal{F}$; usually, this involves mapping the data into a higher dimensional space. For example, if points can be separated by a circle in two dimensional space, they can be separated with a hyperplane in three dimensional space. However, we don't want to have to look at the data to choose a kernel function - instead, we want to just choose a very high dimensional space $\mathcal{F}$. For an RBF kernel, the space $\mathcal{F}$ is infinite-dimensional. We don't ever want to actually compute $\phi(x)$ because it may be very complicated, and result in an infinite-dimensional vector. So, we can only use classifiers where the data $x$ appears within scalar products which we can replace with the kernel function.

Mercer's theorem states that if a function $k$ satisfies $$ \int_{\mathbb{R}^d \times \mathbb{R}^d} k(x, x^\prime) f(x)f(x^\prime)dx dx^\prime \ge 0 $$ (it is positive semidefinite) for any function $f$, then we can write it as $$ k(x, x^\prime) = \sum_{j = 1}^\infty \lambda_j \phi_j(x)\phi_j(x^\prime) $$ where $\lambda_j \ge 0$. The quantity on the right looks like a scalar product, except with a $\lambda_j$ term scaling each entry in the sum. Each $\phi_j$ is a function $\mathbb{R}^d \rightarrow \mathbb{R}$. We then call the (potentially infinite vector) $\phi(x) = (\sqrt{\lambda_1}\phi_1(x), \sqrt{\lambda_2}\phi_2(x), \ldots)$ a feature map. This is the function analog of the fact that you can take a positive semidefinite matrix and decompose it in eigenvalue-eigenvector form. Given two kernels $k_1, k_2$, we can construct new kernels as $k_1 + k_2$, $k_1 \cdot k_2$, $ck_1$, etc.

Example: SVM

For kernelized SVM, the training function becomes $$\min_{v_H, c} \|v_H\|^2_{\mathcal{F}} + \gamma \sum_{i = 1}^n \xi^2 \;\mathrm{s.t.}\; \tilde{y}_i(\langle v_H, \phi(\tilde{x}_i) \rangle - c) \ge 1 - \xi_i, \xi_i \ge 0, i = 1, \ldots, n$$ Note that $\phi$ appears alone, but after converting to the dual we get $$\max_{\alpha \in \mathbb{R}^n} W(\alpha) = \sum\limits_{i = 1}^n \alpha_i - \frac{1}{2}\sum\limits_{i, j = 1}^n \alpha_i\alpha_j\tilde{y}_i\tilde{y}_j(k(\tilde{x}_i, \tilde{x}_j) + \frac{1}{\gamma}\mathbb{I}(i = j)) \;\mathrm{s.t.}\; \sum\limits_{i = 1}^n \tilde{y}_i \alpha_i = 0, \alpha_i \ge 0, i = 1, \ldots, n$$ So, we can still train the SVM because we have just replaced the scalar products in the dual with kernels. The classification becomes $$ f(x) = \mathrm{sgn} \left(\sum_{i = 1}^n \tilde{y}_i \alpha_i^\star k(\tilde{x}_i, x ) - c \right) $$ For an RBF kernel, we are essentially placing a gaussian-shaped function at each support vector, which add up to the decision surface. The decision line is the boundary where the surfaces for each class have the same value. The smoothness of the decision surface is controlled by the $\sigma$ parameter, which makes the Gaussians spread out or contract around each support vector. Note that adding a kernel adds another parameter we should choose using cross-validation.

Example: Graph Kernels

A graph is a set of vertices and edges. For every pair $i, j$ of vertices, there is an edge variable $e_{i, j}$ which is $1$ when there is an edge and $0$ when there isn't. The graph is undirected when $e_{ij} = e_{ji}$ - the vertices are either connected or not; they are not only connected in one direction. We call the set of undirected graphs of finite size $\mathcal{G}$. Say we have a set of training graphs and labels, denoted $(\tilde{G}_i, \tilde{y}_i)$ where $G_i \in \mathcal{G}$. We'd like to be able to label unlabeled query graphs based on this training data. We can characterize a graph according to how often certain patterns appear in it. For example, we might be interested in how many times each possible subgraph of size three shows up in the graph. We can then define a feature map which counts the number of times each pattern (for a subgraph of size $F$) appears, and divides it by the number of subgraphs of size $F$. This maps a graph $G$ to $\phi(G) \in \mathbb{R}_+^d$ where $d$ is the number of patterns we are interested in. We can then define a kernel which takes two graphs $G, G^\prime$ and computes the inner product $\langle \phi(G), \phi(G^\prime) \rangle$. We are therefore regarding graphs as similar when there are patterns that appear often in both of them. We can now train a (kernelized) classification as usual.

Model Selection

Model selection tries to choose the best model out of a candidate list of models by some criteria. Usually, the way to determine “best” is predictive performance. A model can just be a set of different parameter values for some class of models, indexed by parameter setting. For example, the SVM is a family of models, indexed by the margin cost parameter $\gamma$ and the kernel parameter(s) $\sigma$. Model selection will choose the setting of the parameters (call them $\gamma$) which gives the “best” model. Including an optimization over the different possible models will result in overfitting, such that the classifier adapts too well to the given sample data. We'd instead like for our model to generalize well to new data, as opposed to just modeling all the idiosyncrasies of the training data. We can observe the phenomenon of overfitting when our test error begins to increase even as the training error decreases.


When performing cross-validation, we (randomly) split the dataset into train, validation, and test sets. The test data is put aside and is not used for anything except evaluating the model once it is trained, the model is chosen, and all parameters are fixed. We can then treat the accuracy on the test set as our expected accuracy on real-world data, because we are treating it as unseen during the training and model selection process. The training data is used as part of the training algorithm. We can then use the validation set to evaluate the models for different model parameters $\gamma$. The parameter settings which give the best validation error are chosen for the final model. We are measuring model quality using risk, or empirical risk (because we don't know true risk). Usually hyperparameters are those which determine the complexity of the model. Formally, for each parameter setting $\gamma_1, \ldots, \gamma_m$ we use the training data to create a classifier $f(y | x, \gamma_i)$ and compute its empirical risk over the validation set. We then choose $\gamma^\ast$ which minimizes the empirical risk over the validation set. Finally, we compute the error on the test data using the best parameter setting and report that as our expected model performance.

We must determine how to split up the data we are given into training, validation and test sets. A common technique is k-fold cross validation, where you first separate out some proportion of the data as the test set then split the rest into $k$ blocks. You then perform cross-validation $k$ times, where one of the blocks is used as the validation set each time. You are thus evaluating each parameter setting $\gamma_i$ $k$ times. The empirical risk for the parameter setting $\gamma_i$ is then set to the average risk across all folds, and the best model is the one whose average risk across folds is minimized. When $K$ is set to the number of elements in our dataset, we call it leave-one-out-cross-validation, which requires training many many classifiers. More fundamentally, if we remove very little data from the training set, we are not exposing much of the datasets variation. Exposing the variance will prevent overfitting. However, classifiers will perform better if you provide them with more training data, so it may be useful to use a larger $K$ (and thus a larger training set). So, for a small $K$, we may only train bad classifiers and end up regularizing too much. In practice, $K$ is chosen to be $5$ or $10$, which tries to avoid weakening the classifier while still measuring variance caused by different training/validation data.

Tree Classifiers

All classifiers we're considering perform classification according to location in $\mathbb{R}^d$. Liner classifiers use hyperplanes, dividing $\mathbb{R}^d$ into two half-spaces. We can be even more primitive by restricting ourselves to hyperplanes parallel to some axis. We can specify this hyperplane by the axis it intersects and the threshold value. If you think of the vectors in $\mathbb{R}^d$ as a list of attributes or features, we are essentially classifying based solely on the value of one of the features. We can arrange these very simple classifiers into a tree structure to create a tree classifier, which makes a set of decisions based on past decisions, all according to thresholds on individual axes in $\mathbb{R}^d$. For example, we may first test the value of some axis, then based on the outcome we perform another test, and another test, etc. until we reach the bottom of the tree and receive a class label. This effectively divides the space into smaller subspaces with perpendicular boundaries. The hierarchy of the simple decision makers (stumps) determines how the space is split up. We simply classify according to the region that a test point falls into. This is not inherently a two-class procedure.

To train a tree classifier, we need to decide where to place the split points, which region gets assigned to each class, and how many regions we should use. In general, we assign each region to a class according to how many of the training points from each class fall into each region. To determine where to place the splits, we compute a cost as the proportion of training points which get misclassified. Given a tree depth $D$, we iteratively choose the best split for each region. At each step, we have a set of currently defined regions; we choose splits in each region to minimize the risk until we have $D$ splits. The tree depth $D$ results in $2^D$ regions and is thus a complexity parameter; it should be chosen with model selection (cross-validation, “complexity pruning”). We can also choose a split as the split across all current leaves which improves accuracy the most.

Ensemble Methods

To motivate, consider a hypothetical classifier for a two-class problem which just assigns classes at random. This classifier has an expected error of 50%. One way to create such a classifier would be to randomly choose a hyperplane within the data and use it as a linear classifier. If we create $m$ of such “random hyperplane” classifiers, we can classify a point by majority vote, which would still yield a 50% expected error. If we modify each of the hyperplanes to make them a little bit better than 50%, our expected error will also go up a little bit. In the two-class case with $m$ classifiers each with prediction $f_1, \ldots, f_m$ the majority vote classifier assigns $\mathrm{sgn}\left(\sum\limits_{j = 1}^m f_j\right)$. This is an ensemble method - a method which aggregates many classifiers to make a single decision. Usually, classifiers are only required to perform better than 50% (a weak learner).

Condocet's Rule

To decide whether an ensemble method is a good idea, we assume that each classifier outputs the correct label with the same probability $p \in [0, 1]$ and that each classifier outputs its label independently. The probability that the majority vote makes the correct decision, by Condorcet's jury theorem, is $\sum\limits_{j = (m+1)/2}^m \frac{m!}{j!(m-j)!}p^j (1-p)^{m-j}$. When $p > .5$, as the number of votes increase the accuracy increases; for $p < .5$ it decreases. When $p$ is close to $1$, it does not take many votes to get very close to 100% accuracy. So, if we have a set of random and independent weak learners the accuracy will increase as we add more weak learners. However, the weak learners are not independent because they are trained on the same dataset. Different ensemble methods use different strategies to train and combine weak learners so that they behave relatively independently.


Boosting works by adjusting weighting values for each data point, and thus requires the ability to train a weak learner on a weighted training set (such as trees or perceptrons). In boosting, all data points first are given some weight. We then iteratively add weak learners one at a time. After a weak learner is added, training points which are correctly classified are weighted down. We then train another weak learner where additional penalty is imposed for incorrectly classifying points with higher weights. In effect, boosting tries at each iteration to create a classifier which gets the points right that the previous classifier got wrong. For Adaboost, the observation weights $w_i$ are initialized to be uniform. Then, $M$ weak learners are trained one-by-one. The error for the $m$th classifier (with output $g_m(x)$) is computed as $$ \mathrm{err}_m = \frac{\sum\limits_{i = 1}^n w_i \mathbb{I}\{y_i \ne g_m(x_i)\}}{\sum\limits_{i = 1}^n w_i} $$ and the coefficient $\alpha_m = \log( (1 - \mathrm{err}_m)/\mathrm{err}_m )$. The weights are then updated as $w_i = w_i \mathrm{exp}(\alpha_m \mathbb{I}(y_i \ne g_m(x_i)))$, and the next classifier is trained with the new weights. Once trained, the aggregate Adaboost classifier assigns a label according to $f(x) = \mathrm{sgn}\left(\sum\limits_{m = 1}^M \alpha_m g_m(x) \right)$. Note that the alpha coefficients weight the classifiers according to how useful the classifier is. When computing $\alpha_m$ using a $\log$ and then exponentiating it when computing $w_i$, we are essentially setting $w_i^{(m)}$ to $w_i^{(m-1)}$ when $g_m$ classifies $x_i$ correctly and $w_i^{(m-1)}\frac{1 - \mathrm{err}_m}{\mathrm{err}_m}$ when $g_m$ classifies $x_i$ incorrectly.

Adaboost is very popular because it works “out of the box”, can create nonlinear decision boundaries, and can handle multiple classes as long as the underlying weak learner can. An additional nice property is that Adaboost keeps training even after training error has reached a minimum (which is where most classifiers stop). This is because it continues to train on newly re-weighted training data. In practice, this often means that it continues to reduce testing error. Adding additional classifiers with re-weighted training data is sort of like moving towards the maximal-margin solution. This helps improve generalization strategies. An additional application arises when you use Adaboost with decision stumps - you can treat the $\alpha_m$ parameter from each classifier as an indication of how useful that stump is, and therefore how useful that feature dimension is. This can work as a form of feature selection - choosing which features are important for separating the classes. An unusual property of Adaboost is that in some cases it can generate exactly the same set of weights that it has already generated, getting stuck in a “cycle”.

A boosting classifier can also be written as $f(x) = \mathrm{sgn}(F(x))$ where $F(x) = \sum\limits_{m = 1}^M \alpha_m g_m(x)$, which is a linear combination of weak learners. If the weak learners are viewed as elements of a basis, their linear combination is a representation of $F(x)$ in this basis. This linear combination is then a linear approximation of the decision boundary using a basis of weak learners. The approximation is determined by the coefficients $\alpha_m$. Each $\alpha_m$ is obtained by minimizing the risk $(\alpha_m, g_m) = \mathrm{arg} \min_{\alpha^\prime_m, g^\prime_m} R_n(F^{m-1} + \alpha^\prime_m g^\prime_m)$, where $F^{m-1}$ is the decision boundary approximation at the previous iteration. The loss function for this risk which yields Adaboost is the exponential loss, $L^{exp}(y, f(x)) = \mathrm{exp}(-y f(x))$. The quantity $y f(x)$ is $1$ when $x$ is correctly classified, and $-1$ when it's misclassified. The difference then between the exponential loss and the 0-1 loss is that when classifying correctly, exponential loss still contributes, and it's also differentiable. Minimizing this $R_n(F^{m-1}(x) + \beta_m g_m(x))$ involves choosing the classifier $g_m$ which minimizes the weighted misclassification rate, setting $\beta_m = \frac{1}{2}\log \frac{1 - \mathrm{err}_m}{\mathrm{err}_m} = \frac{1}{2}\alpha_m$, and $w_i^{(m + 1)} = w_i^{(m)}\mathrm{exp}(-y_i \beta_m g_m(x_i))$ which yields the Adaboost classifier. In other words, Adaboost approximates the bayes-optimal classifier under exponential loss using a basis of weak learners, each of which optimizes 0-1 loss on weighted data. In some sense, the exponential loss can be seen as a smooth approximation of the 0-1 loss.

Application: Face Detection

In face detection, patches in an image are classified as either containing a face or not, so that the locations of all faces in the image can be found. In this setting, usually there are many more examples of “not face” than there are positive examples (most of the pixels in most images are not faces). This causes a class imbalance, where a classifier could achieve high accuracy just by classifying everything as negative. In order to cope with class imbalance, we need to change the classifier's cost function from 0-1 loss (or accuracy). However, if we just make false negatives more expensive, we risk increasing the number of false positives. A common approach is instead to use many classifiers linked in a chain structure, where each classifier uses an imbalanced loss where each classifier is trained only with the points which were not correctly classified as negative by the previous classifier. The training set for each classifier is then the points labeled as positive and the points incorrectly labeled as negative by the previous classifier. At each step, we want to make the proportion of false positives small and the number of true positives large. When classifying a point, each classifier is then saying “this point is definitely negative” or “this point may be positive”. The following classifier can then either confirm that “this point may be positive” or decide that it was in fact negative. At the end of the cascade, if all classifiers have called a point positive, we can be confident that the point is a positive example.

Face detectors are often run on digital cameras, so the classification step must be computationally efficient, while the training step can take a long time. The features extracted are window features, which compute the difference of the average of pixels in one half of a rectangular window and the average of the pixels in the other half. These window features are extracted across a variety of window shapes, and splits into different halves. The window features are extracted for all windows in the image, for a variety of window sizes. The resulting feature dimension is about 160,000. A boosting classifier is used, trained with decision stumps, to be used for feature selection. A cascade classifier is then trained, which requires explicit setting of the desired false positive rate and detection rate at each set. At each cascade level, a boosting classifier is trained, where we add features (according to our feature selection process) until the desired false positive and detection rates are achieves. At each layer, we train using all positive examples and all misclassified negative examples. This decreases the size of the background dataset from step to step. Overall accuracy can be improved by training multiple (three) classifiers with slightly differing parameter (false positive and detection rates) settings and using a voting scheme.


Bootstrapping attempts to improve the estimator quality by resampling the empirical distribution. The basic bootstrapping algorithm takes a sample $\tilde{x}_1, \ldots, \tilde{x}_n$ and an estimator $\hat{S}$ for some statistic (a function of the actual distribution - mean, variance, etc) $S$. A bootstrap algorithm generates $B$ bootstrap samples $\mathcal{B}_1, \ldots, \mathcal{B}_B$ by sampling $n$ times with replacement from the data sample - thus creating $B$ new datasets, each of the same size $n$ as the original, except that some points may appear multiple times in any $\mathcal{B}_b$. We then estimate $S$ using each of these datasets. The bootstrap estimate of the statistic is then computed by averaging over all bootstrap samples, $\hat{S}_{BS} = \frac{1}{B}\sum\limits_{b = 1}^B \hat{S}_b$. For example, mean and variance are typically estimated by $\hat{\mu} = \frac{1}{n} \sum\limits_{i = 1}^n \tilde{x}_i$ and $\hat{\sigma}^2 = \frac{1}{n} \sum\limits_{i = 1}^n (\tilde{x}_i - \hat{\mu})^2$. The bootstrap estimator computes $\hat{\mu}_b = \frac{1}{n} \sum\limits_{i = 1}^n \tilde{x}_i^{(b)}$ and $\hat{\sigma}^2 = \frac{1}{n} \sum\limits_{i = 1}^n (\tilde{x}_i^{(b)} - \hat{\mu}_b)^2$ for each bootstrap sample $b$. The probability that some point $x_i$ does not appear in some bootstrap after $n$ samples is $(1 - 1/n)^n$, which, as $n \rightarrow \infty$, approaches $1/e \approx 0.3679$. So, as $n$ gets large, any $\tilde{x}_i$ will appear in about $65\%$ of the bootstrap samples. Each $\tilde{x}_i$ may occur multiple times in some samples, and the expected number of times each $\tilde{x}_i$ occurs is $B$. In a vague sense, on average, the boot strap samples become more “typical”.


Instead of re-weighting to achieve high-variance weak learners as in boosting, we will randomly distort the data set with bootstrapping; training each weak learner on the resampled training sets. This results in “bootstrap aggregation” - bagging. In bagging, we use vectors instead of scalars for the class labels, where $y_i$ is a vector which is all $0$s except for the $k$th entry corresponding to $\tilde{x}_i$ being in class $k$. That way, we can compute the proportion of classes in the dataset by averaging over all $y_i$. To train a bagging classifier, we first draw a bootstrap sample $B_b$ and train a classifier $f_b$ on $B_b$. To classify, we average over bootstrap samples, and compute $f_{avg}(x) = \frac{1}{B}\sum\limits_{b = 1}^B f_b(x)$. Each $f_b$ will output a vector in the form of $y_i$ as described above. We can choose a class by just choosing the index corresponding to the largest value of the predicted label vector, which is the class label that most of the classifiers have voted for. When, for example, using bagging on trees to classify two Gaussian distributions, the resulting structure can vary substantially between bootstrap samples. Boosting seems to be “better” at decorrelating the trees.

Random Forests

Bagging attempts to further reduce correlation between the trees generated using a bagging technique. Tree training typically splits over all dimensions; random forests instead choose a subset of the possible dimensions at each split and only optimizes over those. The subsets are chosen at random out of all dimensions for each tree, at each split. This additionally randomizes the trees generated. In some sense, you are not only randomizing the dataset (what bagging does); you are also randomizing along the dimension of the space. Specifically, given an input parameter $m$ (the number of dimensions to be used at each step), for each $b \in 1, \ldots B$ you draw a bootstrap sample $\mathcal{B}_b$ and train a tree classifier $f_b$ on this dataset, where each split selects $m$ axes from $\mathbb{R}^d$ at random and finds the best split $(j^\ast, t^\ast)$ on this subset of dimensions and splits on axis $j^\ast$ at $t^\ast$. For classification, you use a majority vote as in bagging. Usually $m$ is chosen as $\sqrt{d}$ or smaller; it can be chosen using cross-validation. Random forests are often about as good as boosting. They are favored because they require very little hand-tuning of parameters. Random forests tend to have a bias towards axis-parallel decision boundaries.


In regression, we again have measurements in some $d$-dimensional space, but now we are trying to produce real values, that is $y \in \mathbb{R}$. So, we are trying to produce a function $f : \mathbb{R}^d \rightarrow \mathbb{R}$ such that $f(x) \approx y$ for $(x, y)$. We call $f$ a regression function. Given an unseen datapoint, we want to be able to predict what we would expect the corresponding value of $y$ to be, based on some training data. In classification, the points $y$ are just categorical (integer), rather than being real-valued, so classification is a special case of regression where the regression function is piecewise-constant.

Linear Regression

A regressor is called linear when the function $f$ is linear (or, in practice, affine). In two-dimensional space, we are attempting to compute a “line of best fit” for the data - a linear function which approximates the dependence between $x$ and $y$. We thus need to define some notion of what a good “fit” is; some notion of error. Linear (affine) functions are always of the form $f(x) = \beta_0 + \sum\limits_{j = 1}^d \beta_j x_j$ where each $\beta_j \in \mathbb{R}$. The coefficient $\beta_0$ sets the “offset” of the hyperplane, and each $\beta_j$ multiples each axis of the data to produce the resulting $y$. So, to find $f$ we just need to find $\beta \in \mathbb{R}^{d + 1}$.

Least-squares regression

Least-squares regression is a linear regression technique which uses squared Euclidian distance to measure loss of a classifier. Specifically, the squared error loss is given by $L^{se}(y, f(x)) = \|y - f(x)\|_2^2$. To find the optimal $\beta$ which optimizes the squared-error loss, we minimize the squared error loss over the whole training data set $(\tilde{x_i}, \tilde{y_i})$, so that $\hat{\beta} = \mathrm{arg}\min_\beta \sum\limits_{i = 1}^n L^{se}(\tilde{y_i}, f(\tilde{x}_i; \beta))$ where $f(x; \beta) = \beta_0 + \sum\limits_{j = 1}^d \beta_j x_j = \langle \beta, (1, x) \rangle$. We can instead write this in matrix form, where each row of the matrix is one data point so that $\tilde{X}$ is the matrix whose $i$th row is $[1, (\tilde{x_i})_1, \ldots (\tilde{x_i})_d]$. We can write $\tilde{X}_j^{col}$ as the $j$th column of $\tilde{X}$, where typically we call the first column (of all $1$s) as $\tilde{X}_0^{col}$. Then, when computing $\tilde{X} \beta$ we get a column vector whose $i$th entry is $f(\tilde{x}_i; \beta)$. In matrix form, solving the least-squares linear regression problem is then equivalent to $\hat{\beta} = \mathrm{arg}\min_\beta \|\tilde{y} - \tilde{X}\beta\|_2^2$. We can solve analytically by setting $\frac{\partial L^{se}}{\delta \beta} (\hat{\beta}) = -2\tilde{X}^\top (\tilde{y} - \tilde{X}\beta)$. Setting equal to zero, we get the least squares solution $\hat{\beta} = (\tilde{X}^\top \tilde{X})^{-1} \tilde{X}^\top y$. We can also compute the second derivative, $\frac{\partial^2 L^{se}}{\partial \beta^2} (\hat{\beta}) = 2\tilde{X}^\top\tilde{X}$. This quantity is always positive semi-definite. If its inverse exists, we can compute a unique global minimum $\hat{\beta}$.

Aside: Linear algebra review

A matrix $X \in \mathbb{R}^{n \times m}$ defines a linear mapping $f_X : \mathbb{R}^m \rightarrow \mathbb{R}^n$. The image of a mapping $f$ is the set of all possible function values, $\mathrm{image}(f_X) = \{y \in \mathbb{R}^n | Xz = y \;\mathrm{for\;some}\; z \in \mathbb{R}^m\}$. The image of a linear mapping is a linear subspace in $\mathbb{R}^n$. The columns of $X$ form a basis of the image space: $\mathrm{image}(X) = \mathrm{span}(X_1^{col}, \ldots, X_m^{col})$. The number of linearly independent column vectors is the dimension of the image space, and is called the column rank. We can invert $f$ if for each value $y$ there is only one $z$ such that $f(z) = y$. Only square matrices can be inverted, and furthermore $f_X$ is only invertible if the image has the same dimension as the input space. The matrix $X^\top X$ is always a square matrix; it will be invertible if $X$ has full column rank. A matrix $A$ is symmetric if $A = A^\top$ and is orthogonal if $A^{-1} = A^\top$. Orthogonal matrices either rotate or permute the coordinate system. A basis of vectors $\{v_1, \ldots, v_m\}$ of $\mathbb{R}^m$ is an orthonormal basis of they are pairwise orthogonal and have length one, that is $$ \langle v_i, v_j \rangle = \begin{cases} 1, & i = j\\ 0, & i \ne j\\ \end{cases} $$ A matrix is orthogonal if its rows form an orthonormal basis. Given a basis $\mathcal{E} = \{e_1, \ldots e_d\}$ of a vector space, we can write any vector $x$ in the space as $x = \sum\limits_{j = 1}^d [x_j]_{\mathcal{E}} e_j$ where $[x_j]_{\mathcal{E}} = \langle x_j, e_j \rangle$. Given some other basis $\mathcal{B}$, if we form a matrix $M = [[e_1]_{\mathcal{B}}; \ldots [e_d]_{\mathcal{B}}]$ then we can transform between the basis by $M[x]_\mathcal{E} = [x]_\mathcal{B}$. If both $\mathcal{E}$ and $\mathcal{B}$ are orthonormal bases, $M$ is orthogonal. We can then represent any matrix $A$ in terms of the basis $\mathcal{E}$ as $[A]_\mathcal{E} = \left([A(e_1)]_\mathcal{E}, \ldots, [A(e_d)]_\mathcal{E}\right)$. We can convert the matrix $A$ to another basis by $[A]_\mathcal{B} = M[A]_\mathcal{E}M^{-1}$. If we have two orthonormal bases $\mathcal{V}$ and $\mathcal{W}$, there exists an orthogonal matrix $O$ such that $[A]_\mathcal{V} = O[A]_\mathcal{W}O^{-1}$ for any linear mapping $A$.

Given a square matrix $A \in \mathbb{R}^{m \times m}$, we call a vector $\xi \in \mathbb{R}^m$ an eigenvector of $A$ if the direction of $\xi$ doesn't change when you apply the linear transformation $A$. More formally, there exists some corresponding eigenvalue $\lambda \in \mathbb{C}$ such that $A\xi = \lambda \xi$. In other words, multiplying the eigenvector $\xi$ by $A$ simply scales it. When the matrix $A$ is symmetric, all of its eigenvalues are real numbers. Furthermore, the symmetric matrix will have $\mathrm{rank}(A)$ distinct eigenvectors, which are pair-wise orthogonal. Since they are already orthogonal, we can create an orthonormal basis of $\mathbb{R}^{\mathrm{rank}(A)}$ consisting of eigenvectors of $A$. If you transform $A$ into the basis of its eigenvectors, you will get a diagonal matrix of its eigenvalues. We call a matrix positive/negative (semi)definite when all eigenvalues are greater/less than (or equal to) 0. If we have a symmetric full-rank matrix $A \in \mathbb{R}^{m \times m}$ then we can represent any vector $v \in \mathbb{R}^m$ using the basis $\xi_1, \ldots \xi_m$ as $v = \sum\limits_{j = 1}^m v_j^A \xi_j$ where $v_j^A \in \mathbb{R}$. Multiplying by $A$ yields \begin{align*} Av &= A\sum\limits_{j = 1}^m v_j^A \xi_j\\ &= \sum\limits_{j = 1}^m v_j^A A \xi_j\\ &= \sum\limits_{j = 1}^m v_j^A \lambda_j \xi_j \end{align*} which is itself a linear combination of the eigenvectors $\xi_j$. When we repeatedly apply some matrix $A$ to a vector $v$, the scale doubles with each application $A$ when $v$ is an eigenvector of $A$. When $v$ is not an eigenvector, the direction of $A^n v$ converges to the direction of the eigenvector with the largest eigenvalue as $n$ grows large.

Symmetric matrices frequently show up in quadratic forms, which is a function $q_A(x) : \mathbb{R}^m \rightarrow \mathbb{R}$ which computes $\langle x, Ax \rangle$. This is the $m$-dimensional analog of a one-dimensional parabola. If you slice through the quadratic form of a positive definite matrix in any direction you receive a parabola and its contour lines will be elliptic. When the eigenvalues are very close, the contour lines will be roughly circles; otherwise smaller eigenvalues will cause strong eccentricity in the direction of the corresponding eigenvector.

Aside: Covariance matrices

We can compute the covariance of two random variables $X_1, X_2$ as $\mathrm{Cov}(X_1, X_2) = \mathbb{E}[(X_1 - \mathbb{E}[X_1])(X_2 - \mathbb{E}[X_2])]$. The covariance of a single random variable against itself is called its variance. If we have a random vector $X = [X_1, \ldots, X_m]$, we can form its covariance matrix as $\Sigma_{i, j} = \mathrm{Cov}(X_i, X_j)$. This matrix will be symmetric because $\mathrm{Cov}(X_1, X_2) = \mathrm{Cov}(X_2, X_1)$. The Gaussian distribution is defined according to its mean $\mu$ and covariance matrix $\Sigma$ as $p(x; \mu, \Sigma) = \frac{1}{\sqrt{2\pi \mathrm{det}(\Sigma)}} \mathrm{exp}\left(-\frac{1}{2} \left \langle (x - \mu) \Sigma^{-1} (x - \mu) \right\rangle \right )$. It will have a maximum at $x = \mu$. Because $\Sigma$ is symmetric, we know that it has an eigenvector orthonormal basis $\xi_1, \ldots, \xi_m$ with corresponding eigenvalues $\lambda_1, \ldots, \lambda_m$. If we rotate the coordinate system of $\Sigma$ to $\xi_1, \ldots, \xi_m$ then the resulting matrix will be diagonal; we write $\Sigma_{[\xi_1, \ldots, \xi_m]} = \mathrm{diag}(\lambda_1, \ldots, \lambda_m)$. The density function will be elongated most along those eigenvalue axes with largest eigenvalues. We can think of the eigenvalues as random vectors, represented as a linear combination of the axis orthonormal basis $e_1, \ldots, e_m$ by $\xi_i = \sum\limits_{j = 1}^m \alpha_{ij}e_j$. We can transform our original matrix to the eigenvector basis using the orthogonal transformation matrix $O_{ij} = \alpha_{ij}$. Given a sample $X$ from our Gaussian distribution, we can represent it in the eigenvector orthonormal basis by $(X_{[\xi_1, \ldots, \xi_m]})_i = \sum_{j = 1}^m \alpha_{ij}X_j$. This is just the sum of random variables, which is itself a random variable. Given some gaussian distribution, we can shift the axis of its coordinate system by $\mu$ and rotate its coordinate axes using the eigenvector orthonormal basis of $\Sigma$. The resulting transformed Gaussian will have a diagonal covariance matrix $\Sigma_{[\xi_1, \ldots, \xi_m]} = \mathrm{diag}(\lambda_1, \ldots, \lambda_m)$. A point $X_{[\xi_1, \ldots, \xi_m]}$ represented in the new coordinate system will consist of $m$ independent Gaussian variables with mean $0$ and variance $\lambda_i$. In other words, under a Gaussian assumption, the dependence between random values in some random vector is an artifact of being in the wrong coordinate space.


Least squares will only work when $\tilde{X}$ has full column rank, or equivalently that $\tilde{X}^\top \tilde{X}$ is invertible. If $\tilde{X}^\top \tilde{X}$ is almost singular (very close to an non-invertible matrix), least squares will be numerically unstable, resulting in a high variance of predictions when changing the inputs slightly. Any singular matrix will map every point in some linear subspace to a single point. If a matrix is almost singular, it won't map all of these points to the same point, but it will map them all very close (above numerical resolution) together. For example, consider a matrix $A$ whose smallest (in magnitude) eigenvalue $\lambda_{min}$ is not $0$ (so the matrix is not singular) but it is very small in magnitude. If $x_1, x_2$ are two points in the subspace spanned by $\xi_{min}$ which are very far apart, then their distance under the image of $A$ given by $\| Ax_1 - Ax_2 \|$ will be tiny. The inverse can then map points which are very close together to very different predictions. So, the predictions become unreliable and volatile. For linear regression, if we take some training data point $\tilde{x}_i$ and make a small change in its output value $\tilde{y}_i$ then intuitively we do not expect a large change on prediction. However, if $\tilde{X}^\top \tilde{X}$ is almost singular, then $\hat{\beta}$ and hence the prediction $\langle \hat{\beta}, (1, x) \rangle$ will change dramatically. We can measure the extent to which a matrix is close to singular by studying its largest and smallest eigenvalues $\lambda_{max}$ and $\lambda_{min}$. If the smallest and largest eigenvales scale their eigenvectors by very different amounts, the matrix must be close to singular. We therefore define the spectral condition $c(\cdot)$ of a matrix as $c(A) = |\lambda_{min}|/|\lambda_{max}|$.

We can try to make least squares more robust by trying to make $\tilde{X}^\top \tilde{X}$ less close to being singular. We can instead solve a ridge regression problem where $\hat{\beta} = (\tilde{X}^\top \tilde{X} + \lambda \mathbb{I})^{-1} \tilde{X}^\top y$ where $\mathbb{I}$ is the identity matrix and $\lambda$ is some scalar. The addition of the term $\lambda\mathbb{I}$ adds $\lambda$ to all of the eigenvalues of $\tilde{X}^\top \tilde{X}$ because $(\tilde{X}^\top \tilde{X} + \lambda\mathbb{I})\xi_i = (\lambda_i + \lambda)\xi_i$. We can therefore consider $\tilde{X}^\top \tilde{X} + \lambda\mathbb{I}$ to be a regularized version of $\tilde{X}^\top \tilde{X}$ which tries to ensure that $\tilde{X}^\top \tilde{X}$ is not close to singular. We can also formulate this problem as a minimization problem (which we don't actually need to do because we have a closed-form solution) as $\hat{\beta} = \mathrm{arg}\min_\beta \|y - \tilde{X}\beta \|_2^2 + \lambda \|\beta\|_2^2$. If we solve this equation for the minimizer, we get back the closed-form ridge regression. We can also write $L^\prime(\beta) = \sum\limits_i L^{se}(y_i, f(\tilde{x}_i; \beta)) + \lambda \|\beta\|_2^2$. We call the loss term a goodness-of-fit, and the $\|\beta\|_2^2$ a penalty. The penalty tries to make those solutions we like “cheaper” and make those we want to discourage more expensive. The parameter $\lambda$ controls how much we want to trade off goodness of fit for solutions we like. If we change $\beta$ by some amount, the quadratic penalty will change according to $\beta$'s value. In other words, the amount that some change in $\beta$ changes the overall cost depends on the value of $\beta$ itself. You can therefore get a larger reduction in the overall cost by reducing the largest coefficient by some amount. So, if $\lambda$ makes the penalty very influential, then the solution will favor $\beta$ whose entries are all the same size. This is where the name “shrinkage” comes from - if the sum of the elements in $\beta$ is fixed, then the solution favors solutions which are as close as possible to the origin; that is, with all entries the same size.


In many problems of interest, the data is very high-dimensional with only a few important dimensions; least-squares treats all dimensions equally and averages relevant dimensions with irrelevant ones, resulting in a loss of the important “signal”. We are interested in selecting some small subset $d$ of the $D$ dimensions and ignore the rest (like feature selection). This may improve our prediction, and also may tell us which dimensions are important. As a result, we are interested in $\beta$s which have a small number of non-zero entries but still perform good prediction. When a solution $\beta$ has few nonzero entries, we call it sparse. A good penalty would be one which keeps $\beta_k$ which are large, and pushes the rest towards zero. A quadratic penalty won't work - we don't want all the entries to be the same size. Instead, we want a linear penalty, giving $\hat{\beta} = \mathrm{arg}\min_\beta \|y - \tilde{X}\beta\|_2^2 + \lambda\|\beta\|_1$ where $\|\beta\|_1 = \sum_i |\beta_i|$. For this penalty, the reduction in cost for changing $\beta$ by some amount does not depend on the value of $\beta$. Because of the shape of the contour lines of the Euclidian distance, the L1 norm will always favor solutions on the axis (sparse). This is not true of the L2 norm. $\lambda$ will thus effect how important it is for the solution to be on the axis lines.

In the more general form, we can use any $\ell_p$ norm for the penalty, defined by $\|\beta\|_p = \left( \sum\limits_{j = 1}^D |\beta_j|^p \right)^\frac{1}{p}$, which is the Euclidian norm for $p = 2$ and the L1 norm (sum of absolute values) for $p = 1$. For $p > 2$, the contour lines (unit balls) begin to look like squares and focus more and more on large entries. In the limit the norm measures the largest absolute entry. For $p < 1$, sparsity is encouraged more an more but all of these norms are nonconvex - the algorithm becomes combinatoric. When $p$ approaches zero, it simply records the number of nonzero entries. This is nonconvex - we would have to try every combination of ones and zeros. In practice, we still use the L1 norm because it is the best convex approximation of the L0 norm.

In general, for any $p \ne 2$, we must solve $\hat{\beta} = \mathrm{arg}\min_\beta \|\tilde{y} - \tilde{X}\beta \|_2^2 + \lambda \|\beta \|_2^2$ which looks like a lagrange version of $\min_\beta \|\tilde{y} - \tilde{X}\beta \|_2^2 \;\mathrm{s.t.}\; \|\beta\|_p^p \le 0$ but this constraint doesn't make any sense because it is only satisfied when $\beta = 0$. However, a constant shift does not affect the location of the minima, so $\mathrm{arg}\min_\beta \|\beta\|_p^p = \mathrm{arg}\min_\beta \|\beta\|_p^p - t$. We can thus formulate the constraint as $\|\beta\|_p^p \le t$. This problem can then be solved with a numerical optimization technique; it is convex for $p \ge 1$.

Model Bias and Variance

We have seen that we can trade off the flexibility of the model for the stability of the estimates - more parameters mean that we can adapt better to data, but this makes our estimates less stable. In classification, this is overfitting. We can study this effect by interpreting models as probability distributions. For example, we can consider a random variable that can take on three possible values $\{a, b, c\}$. There are three distributions which are “special” - the ones which assign a probability of $1$ to each possible outcome, e.g. $\delta_{\{a\}} = \mathrm{Pr}\{X = a\} = 1$. Each distribution can thus be represented as $P = c_a \delta_{\{a\}} + c_b \delta_{\{b\}} + c_c \delta_{\{c\}}$ where $c_a + c_b + c_c = 1$ and $c_i \ge 0$. That is, we can interpret any probability distribution as some convex combination of distributions which place all probability on a single outcome. These distributions $\delta_x$ are called Dirac distributions or point mass. This function has the important property that $\int_X f(x) \delta_{x_0}(dx) = f(x_0)$. These distributions cannot be represented as convex combinations of other distributions. For higher dimensional spaces, each $\delta_x$ is still an extreme point of the space of possible distributions (a high dimensional triangle). If $\{x_1, \ldots, x_n\}$ is a sample, then $\mathbb{F}_n = \frac{1}{n}\sum_{i = 1}^n \delta_{x_i}$. Now, we can look at a model as a subset of this space of possible probability distributions. The actual distribution of our data $p_0$ may not be the same point in the space of possible probability distributions as $\mathbb{F}_n$. The law of large numbers says that as $n \rightarrow \infty$, $\mathbb{F}_n$ approaches $p_0$.

A statistic is some function of a distribution, written $S[p]$. It's a function where you plug in a distribution, and it gives you some value. Many statistics (like mean and variance) can be written as $\int_x f(x) p(x) d x$ where $f(x)$ is some function. When $f(x) = x$, we get the expected value. If we plug in the empirical distribution $\mathbb{F}_n$ for $p$, we get $\hat{S} = \frac{1}{n} \sum_{i = 1}^n f(x_i)$ which is the “plug-in estimator” of some statistic. When $S$ is the mean or variance, the plug-in estimator looks like the Maximum Likelihood Estimator, except that we have made no assumption about the distribution to obtain this result.

Recall that a statistical model $\mathcal{P}$ is a set of distributions indexed by parameters. We can therefore call a model a subset of $M(X)$, where $M(X)$ is the set of all probability distributions on the sample space $X$. Suppose that we observe some data generated by some distribution $p_0$. If the true distribution is in our model $\mathcal{P}$ (space of possible distributions), then we say that the model is correctly specified. When $p_0$ is not in our model, we say it is misspecified. Equivalently, if the empirical distribution converges to something in $\mathcal{P}$, the model is correctly specified. When the model is correctly specified, we should be able to find some parameter $\theta \in \mathcal{T}$ which fully explains the data, using some estimation procedure. If correctly specifying the model is extremely important, then we should make the model $\mathcal{P}$ as large as possible. This corresponds to a model which is more flexible; it contains more possible distributions $p$. When we intentionally misspecify, we are essentially saying that no matter how much data we observe, we can never fully explain the data - this is a form of bias. However, if we make the model more flexible, then they can vary dramatically between sample sets, unless we have tons of data. There is therefore a trade-off between model bias and variance.

It would be useful to have some notion of the complexity of the model - how large the model space is. A simple measure of model complexity is the number of scalar parameters of the model - the degrees of freedom. A Gaussian in $\mathbb{R}^d$ with fixed variance has $d$ degrees of freedom - one for each entry of its mean vector. When the variance is not fixed, the model has $d + d(d - 1)/2$ degrees of freedom because the symmetric Covariance matrix has $d(d - 1)/2$ degrees of freedom. Note that this doesn't fully quantify the freedom of this model - we also need the covariance matrix to be positive semidefinite, which is not capture by simply saying it adds $d(d - 1)/2$ degrees of freedom. Some models are nonparametric (with an infinite-dimensional parameter space). When the number of parameters grow as the number of data points grow, the model is nonparametric because you have to assume the model has an infinite-dimensional parameter space to accommodate any possible amount of data.

Say we perform regression over some dataset $X$ using some function $f \in \mathcal{F}$ where $\mathcal{F}$ is the space of all possible regressor functions (e.g. in linear regression $\mathcal{F}$ is the space of linear functions) where we are also modeling some additive noise $\epsilon$ with zero mean and variance $\sigma_{\epsilon}^2$. Our model is then $Y = f(X) + \epsilon$. If we find some $\hat{f}$ in our model space which approximates $f$, our prediction error for some point $x_0$ is then the expected error, which we can compute as $$ \mathbb{E}(Y - \hat{f}(x_0))[X = x_0] = \sigma_\epsilon^2 + (\mathbb{E}[\hat{f}(x_0)] - f(x_0))^2 + \mathbb{E}[\hat{f}(x_0) - \mathbb{E}[\hat{f}(x_0)]]^2 $$ The term $\sigma_\epsilon^2$ we call the “irreducible error” - no matter what we do, we cannot remove this error. The term $(\mathbb{E}[\hat{f}(x_0)] - f(x_0))^2$ is the bias term - it decreases with model complexity, because the more flexible our model is the easier it is to find a $\hat{f}$ which approximates $f$. The term $\mathbb{E}[\hat{f}(x_0) - \mathbb{E}[\hat{f}(x_0)]]^2$ is the model variance, which increases as our model becomes more flexible. In linear least-squares regression, our Variance term is $\|(\tilde{X}^\top \tilde{X})^{-1} \tilde{X}^\top x_o \|\sigma_\epsilon^2$. For regularized ridge regression, the varaince term is $\|(\tilde{X}^\top \tilde{X} + \lambda \mathbb{I})^{-1} \tilde{X}^\top x_o \|\sigma_\epsilon^2$. This is generally smaller than the unregularized case, so the bias is increased. Intuitively, we are distorting our estimate in return for potentially getting an estimator with lower variance. In general, it can be helpful (in terms of prediction error) to permit some bias (even if means you can't recover the true distribution) because it decreases the variance (so your model may converge to something “good” more quickly). When cross-validating, we can get some intuition as to the variance of our model by examining how the estimator changes across folds.

Unsupervised Learning

When no label information is available for the training data, we cannot use supervised learning. When we simply try to find some structure in the data, we are doing unsupervised learning. This requires some explicit model selection - our choice of a model denotes what kind of structure/pattern we are looking for.

Dimension Reduction using PCA

If we have some training data $\{x_1, \ldots x_n\}$ in some high dimensional space $\mathbb{R}^D$, we might be interested in finding a lower-dimensional projection of the space which preserves the important structure in the data. We are essentially looking for a low dimensional subspace $V \subset \mathbb{R}^D$ with $\mathrm{dim}(V) = d < D$, onto which we will project our training data to obtain some dimension-reduced representation. For $d = 2, 3$, this is useful for plotting.

Principal component analysis makes the assumption that the dimensions which capture most of the uncertainty of the data are those which are most important or interesting. We can measure the uncertainty of each dimension as its variance. Principal Component Analysis therefore computes the empirical covariance matrix of the data and computes the eigenvalues and eigenvectors $\xi_1, \ldots \xi_D$ of the covariance matrix. The $d$ largest eigenvalues can be thought of as denoting the $d$ more important dimensions. We can therefore define the subspace $V = \mathrm{span}\{\xi_1, \ldots, \xi_d\}$ spanned by the $d$ eigenvectors with the largest eigenvalues to project onto $V$ by computing $x_i^v = \sum\limits_{j = 1}^d \langle x_i, \xi_j \rangle \xi_j$.

The empirical covariance of the data can be computed as $\hat{\Sigma}_n = \frac{1}{n} \sum\limits_{i = 1}^n (x_i - \hat{\mu}_n)(x_i - \hat{\mu}_n)^\top$ where $\hat{\mu}_n$ is the empirical estimation of the mean. PCA attempts to project onto those dimensions for which the variance is highest. The variance of the projected data is given by $\langle v, \hat{\Sigma}_n v\rangle$ where $v$ is the vector you project onto to get the PCA representation for one of the dimensions. We can therefore formulate PCA as an optimization problem as $\max_v \langle v, \hat{\Sigma}_n v \rangle \;\mathrm{s.t.}\; \|v\|_1 = 1$. The constraint esnures that we are only maximizing this projection by changing the direction of $v$ (otherwise we could just scale up the length of $v$ to increase the objective). The optimal direction will always be the one which lines up with the eigenvector corresponding to the largest eigenvalue. For the next dimension, you project using only the data which is orthogonal to $v$, and condinue onwards. This chooses the direction of the eigenvector corresponding to the largest, second largest, etc. Eigenvalues. So, we can justify PCA as the projection which maximizes the variance of the projected data.

Example: MNIST

In the MNIST digit dataset, each data point is in $\mathbb{R}^{256}$. The mean and eigenvectors for all of the $3$ digits will also be vectors in $\mathbb{R}^{256}$. The first few eigenvectors can be regarded as a summary of the main features of the data. If we represent each image as its projection onto the first $d$ eigenvectors, then we need to store the $d$ eigenvectors (to reconstruct the image - this is like our dictionary) as well as the $d$ coefficients (the PCA projection onto each eigenvector) for each image. We can then reconstruct by $x^{(d)} = \sum\limits_{i = 1}^d c_i \xi_i$ where $c_i$ is the projection of $x$ onto the $i$th eigenvector $\xi_i$. So, if we have $n$ images then we only need to store $nd + d$ numbers. If this is less than $256n$, then we are using less data than if we were using raw images. However, we are losing some part of the data - the part explained by the eigenvectors we are not storing. So, we call this lossy compression. The value of $d$ is a model selection problem. For visualization, we use 2 or 3. For approximation or compression, we'd like to minimize some approximation error while still keeping $d$ small. One way to quantify the importance of each eigenvector is to look at the magnitude of the corresponding eigenvalues. If there is a large jump or gap in the eigenvalue spectrum (the eigengap) then this suggests that only a few of the eigenvectors are useful.


In data clustering, we want to separate data into disjoint groups but we don't have labeled training data. We only have data $x_1, \ldots, x_n$ without labels. We assume that each data point belongs to exactly one group or class, which we call clusters. We'd like to be able to determine what the best (in some sense) clustering is, given only the data. For $K$ clusters, we write the assignment of each data point $x_i$ as $m_i$ where $m$ is a length-$n$ vector denoting the cluster of each data point.

Example Application: Image Segmentation

In computer vision, a problem of interest is the partitioning of an image into coherent regions. The desired solution depends greatly on the definition of what a “coherent” region means. There is no semantics or labeling - only segmenting into regions which look “similar”. For simple images, it is intuitive what the segments “should be”, based on color or darkness. In practice, we extract features for each pixel in the image by extracting features from the surrounding small window of pixels. Each window is some feature vector $x_i$. If we perform clustering on the data points $x_i$, each cluster represents a segment of the image.


K-means will segment the dataset into $K$ different clusters in an unsupervised manner. First, $K$ random “cluster centers” $\mu_i^{(0)} \in \mathbb{R}^d$ are chosen in the feature space $\mathbb{R}^d$. Then, at each iteration $j$, the cluster assignment value $m_i$ for the point $x_i$ is set by computing $m^{(j + 1)} = \mathrm{arg}\min_{k \in \{1, \ldots, K\}} \|x_j - \mu_k^{(j)}\|$. The cluster centers are then updated by computing the mean of all points assigned to each cluster; that is, $$ \mu_k^{(j + 1)} = \frac{1}{ |\{i|m_i^{(j + 1)} = k\}|} \sum_{i|m_i^{(j + 1)} = k} x_i $$ We iterate until the solution becomes stable, that is, until the change in the mean location is small, measured by $\sum\limits_{k = 1}^K \|\mu_k^{(j + 1)} - \mu_k^{(j)}\| < \tau$. If we compute the distance at any point in the space to each cluster center, we can determine which cluster a point at this location would be assigned to. In this way, the space is divided into $K$ regions, corresponding to which cluster center is closest. Partitioning the space in this way is called a Voronoi decomposition.

We can re-interpret the cluster centers $\mu_k$ as means of spherical Gaussian distributions (with unit covariance). The algorithm can be reformulated by starting with $K$ randomly chosen vectors and iteratively assigning each $x_i$ to the Guassian under which it has the highest probability of occurrence then re-fitting the Gaussian for each cluster to the points assigned to it. If we re re-fit the Gaussians using maximum likelihood, this algorithm is equivalent to the one described above. This is similar to the Gaussian Discriminant classifier, except that we are creating “fake” training data at each iteration based on the state of our clustering. This is common to many clustering algorithms - at each step, make fake training data based on the current state, then re-fit the model using this training data. We can get a better statistical explanation than our reformulation of the K-means algorithm here, and in addition gain the ability to generalize to other distributions (e.g. Gaussians with general covariance structure and multinomial distributions).

Mixture Models

If we are given a parametric model $p(x | \theta)$ and a probability density $q(\theta)$, we can compute the mixture model as $\pi(x) = \int_\mathcal{T} p(x|\theta) q(\theta) d \theta$. $q$ is called the mixing distribution. Integrals of this form are suggestive of an iterative procedure, where we sample $\theta^{(i)}$ from $q$, then sample $X_i$ from $p(\cdot | \theta^{(i)})$. In other words, it always corresponds to a two-state sampling procedure. A typical example is the student's $t$-distribution, where $q$ is a Gamma distribution with parameters $\alpha, \beta$ and $\theta$ is treated as the inverse variance $1/\sigma^2$ for $p$ which is regarded as a 0-mean Gaussian distribution. We are then given the mixture model $$ \pi\left( x | 0, v = \frac{\alpha}{\beta}, \tau = 2\alpha \right) = \int_{\mathbb{R}^+} \mathrm{Normal}\left(x \biggr| 0, \frac{1}{\theta}\right)\mathrm{Gamma}(\theta | \alpha, \beta) d \theta $$ This mixture is similar to a Gaussian, except that mixing with the Gamma distribution basically causes a “long tail”.

In practice, we will only consider finite mixture models, defined as $\pi(x) = \sum\limits_{i = 1}^K c_k p(x | \theta_k)$ where $\sum\limits_k c_k = 1$ and $c_k \ge 0$. We are essentially taking a convex combination of differently parameterized versions of the same distribution (for example, Gaussians). Finite mixtures are a special case of mixture models where we set $q = \sum\limits_{k = 1}^K c_k \delta_{\theta_k}$ - when integrating, we will “pick out” the distribution parameterized by $\theta_k$. These are often referred to as “discrete mixture models”, because the mixing measure is discrete. Each $\delta_{\theta_k}$ corresponds to some specific discrete parameter setting. The values $c_k$ then control the “weight” of the distribution parameterized by $\delta_{\theta_k}$. To sample from this discrete $q$, we choose some random $k$ with probability $c_k$. We then take the corresponding $p(x | \theta_k)$ and sample from it. In a clustering models, $p(x | \theta_k)$ is the distribution of the $k$th cluster. The $c_k$s are then the relative cluster sizes.

The maximum likelihood estimator for a finite mixture model is given by $$ (\hat{c}, \hat{\theta}) = \mathrm{arg}\max_{c, \theta} \prod_{i = 1}^n \left( \sum_{k = 1}^K c_k p(x_i | \theta_k) \right) $$ We can compute the log likelihood, differentaite and set equal to zero to find the maximum likelihood solution $$ \sum_{i = 1}^n \frac{c_k \frac{\partial}{\partial \theta_k} p(x_i | \theta_k)}{\sum\limits_{k = 1}^K c_k p(x_i | \theta_k)} = 0 $$ The appearance of the $i$ term in the denominator makes solving this equation difficult, because we can't just ignore the denominator. In addition, solving this problem will not assign the points to each cluster.

The EM algorithm

In k-means, we can simultaneously estimated the assignments $m_k$. These variables are called latent or hidden variables, because they are not observed. If we knew the value of the latent variables $m_i$, we could estimate $p(x | \theta_k)$ separately, over each cluster $k$. We can then estimate the cluster proportions $c_k$ as the proportion of the points in each cluster. The EM algorithm encapsulates this proces by estimating the assignments $m_i$ given the current estimates of $c_k$ and $\theta_k$ (the expectation step), then estimates the parameters $c_i$ and $\theta_i$ given the current estimates of the assignments (maximization step).

Instead of representing each $m_i$ as some integer in $\{1, \ldots, K\}$, we can instead write them as vectors $M_i$ with $k$ entries, where a $1$ in the $k$th entry indicates that the point $x_i$ is in cluster $k$ (this is the same as in Bagging). We can collect these vectors into a matrix $M$, where each row corresponds to data point and each column corresponds to a cluster (i.e. $M_{i, k} = 1$ if $x_i$ is in cluster $k$). When the values in $M$ are $0$ or $1$, we are making hard assignments - each point is in one and exactly one cluster. If, instead, we allow the values to be continuous on the interval $[0, 1]$, we are instead saying that the probability that $x_i$ is in $k$ is $a_{ik}$ where $a_{ik}$ is the “soft” version of $M_{ik}$. In this interpretation, we also require that each vector $a_i$ sums to one, so that they can truly be interpreted as probabilities. We can then make these probabilistic assignments “hard” by choosing the maximum value in each $a_i$. The vectors $M_i$ are latent variables, which will be estimated using the EM algorithm. The EM algorithm will then, at each step, compute the expected value of the hard assignment by $$ a_{i,k} = \frac{c_k p(x_i | \theta_k)}{\sum\limits_{j = 1}^K c_j p(x_i|\theta_j)} $$ where $a_{i, k}$ cna be thought of as the probability that $x_i$ was generated by component $k$ given the mixing weights $c$ and parameter $\theta$.

In the M-step, we re-compute the parameters of the model. We set the cluster proportion estimate $$ \hat{c}_k = \frac{\sum\limits_{i = 1}^n M_{ik}}{n} = \frac{\sum\limits_{i = 1}^n a_{ik}}{n} $$ where we are making an approximation of the hard assignment $M_i$ in switching to the estimated soft assignments $a_{ik}$. To estimate the distribution parameters for a Gaussian, we compute $$ \hat{\mu}_k = \frac{\sum\limits_{i = 1}^n M_{ik} x_i}{M_{ik}} = \frac{\sum\limits_{i = 1}^n a_{ik} x_i}{a_{ik}} $$ where again we approximate $M_i$ with the estimated soft assignments $a_{ik}$. Similarly for covariance we can estimate $$ \hat{\Sigma}_k = \frac{\sum\limits_{i = 1}^n a_{ik}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^\top}{\sum\limits_{i = 1}^n a_{ik}} $$ We iteratively perform an E-step to compute the assignment weight matrix and an M-step to re-estimate the parameters until convergence. Fitting general Gaussians using E-M may be slower than k-means because you are also determining the covariance.

In the completely general case, we are estimating $\theta$ and $c$ by approximate Maximum Likelihood to obtain $\pi(x) = \sum\limits_{k = 1}^K c_k p(x | \theta_k) = \pi(x|c, \theta)$. Whenever we have a density of a single random variable $p(x)$, we can always regard it as the distribution of two random variables with one variable integrated out, $p(x) = \int p(x, y) dy$. We can similarly write for a finite distribution $\pi(x|c, \theta) = \sum\limits_M \pi(x, M | c, \theta)$. We can compute the joint log likelihood as $$ \sum_{i = 1}^n \log(\pi(x_i | c, \theta )) = \sum_{i = 1}^n \log\left(\sum_{M_i} \pi(x, M_i|c, \theta) \right) $$ This function is hard to maximize because of the sum in the logarithm; we can instead maximize the lower bound $$ \sum_{i = 1}^n \sum_{M_i} \log(\pi(x, M_i|c, \theta)) \le \sum_{i = 1}^n \log\left(\sum_{M_i} \pi(x, M_i|c, \theta) \right) $$ This is “safe”, but it's not precise; we may not get the actual optimal point. We want to maximize this quantity, but we don't know $M_i$. So, at each iteration we use the estimates of $c$ and $\theta$ from the previous step. This gives $\mathrm{Pr}\{M_i = k | c^{(j)}, \theta^{(j)}\} = a_{ik}^{(j)}$, which produces the M-step $$ (c^{(j + 1)}, \theta^{(j + 1)}) = \mathrm{arg}\max_{c, \theta} \left\{ \sum_{ik} a_{ik}^{(j + 1)} \log( c_k p(x_i | \theta_k)) \right\} $$

We can show that our estimate at each step increases likelihood of the model unless our parameters already describe a (locally) optimal solution. However, the solution may not be global - it is effectively a gradient method (because it always improves the estimate) which searches for stationary points. In practice, if you run the EM algorithm twice and get two sets of parameters, you can plug them into your log-likelihood and choose the parameter setting which has a higher likelihood. So, we can compare the local maxima we find to choose something which is the best local maximum we've found so far (which hopefully may be the global maximum). In practice, you can perform E-M many times with random initialization and then choose the best one.

Exponential Family Models

The exponential family is the set of densities which have the form $$ P(x|\theta) = \frac{h(x)}{Z(\theta)} e^{\langle S(x), \theta \rangle} $$ The function $S$ from the sample space to some Euclidian space is called the “sufficient statistic”, $h$ is a function from the sample space to $\mathbb{R}_+$ and the function $Z$ which maps the parameter space to $\mathbb{R}_+$ is effectively a normalization factor which is called the partition function. This form of distributions gives rise to many useful properties, and many of the models we use in practice are exponential family models. Using exponential family models, we can define methods in general and then use them for specific exponential family members. We can move all terms into the exponential to get $$ p(x|\theta) = \mathrm{exp}(\langle S(x), \theta \rangle - \phi(x) - \psi(\theta)) $$ where $\phi(x) = -\log(h(x))$ and $\psi(\theta) = \log(Z(\theta))$. Note that the data and parameters only interact through the linear term $\langle S(x), \theta \rangle$; there is no joint nonlinear function of $x$ and $\theta$. The partition function $Z$ is never a design choice, because we must have $$ \int_X \frac{h(x)}{Z(\theta)} e^{\langle S(x), \theta \rangle} dx = 1 \rightarrow Z(\theta) = \int_X h(x) e^{\langle S(x), \theta \rangle} dx $$ so really, each exponential family member is specified completely by $S$ and $h$.

Example: 1-d Gaussian

We can write \begin{align*} \frac{1}{\sqrt{2\pi}\sigma} \mathrm{exp}\left( -\frac{1}{2} \frac{(x - \mu)^2}{\sigma^2} \right) &= \frac{1}{\sqrt{2\pi}\sigma} \mathrm{exp}\left(-\frac{1}{2} \frac{x^2}{\sigma^2} + \frac{2x\mu}{2\sigma^2} \right) \mathrm{exp}\left(-\frac{1}{2} \frac{\mu^2}{\sigma^2} \right)\\ &= c(\mu, \sigma) \mathrm{exp}\left(x^2\frac{-1}{2\sigma^2} + x\frac{\mu}{\sigma^2} \right) \end{align*} such that $S(x) = (x^2, x)$, $\theta = \left(\frac{-1}{2\sigma^2}, \frac{\mu}{\sigma^2}\right)$, $h(x) = 1$ and $Z(\theta) = c(\mu, \sigma)^{-1}$ where $c(\mu, \sigma)$ is some function of $\mu$ and $\sigma$. So, $S(x) = (x^2, x)$, $\theta = \left(\frac{-1}{2\sigma^2}, \frac{\mu}{\sigma^2}\right)$, $h(x) = 1$ and $Z(\theta) = c(\mu, \sigma)^{-1}$. In higher dimensional Gaussians, we simply change to $S(x) = (xx^\top, x)$ and $\theta = \left(-\frac{1}{2}\Sigma^{-1}, \Sigma^{-1} \mu \right)$.

In most problems of interest, there is some natural statistic which matters most. When we choose the exponential family model with this statistic, usually it is the most simple “prototypical” distribution on that sample space, for that statistic.

If we apply the log likelihood to the exponential family density, then we get $$ \log\prod_{i = 1}^n p(x_i | \theta) = \sum_{i = 1}^n \log(x(x_i)) - \log(Z(\theta)) + \langle S(x_i), \theta \rangle $$ Differentiating with respect to $\theta$ we get $$ \frac{\partial}{\partial\theta} \sum_{i = 1}^n \log(x(x_i)) - \log(Z(\theta) + \langle S(x_i, \theta \rangle ) = -n \frac{\partial}{\partial\theta} \log(Z(\theta)) + \sum_{i = 1}^n S(x_i) $$ So, we can find the MLE estimator of $\theta$ by setting equal to $0$ which yields $$ \frac{\partial}{\partial\theta} \log(Z(\hat{\theta})) = \frac{1}{n} \sum_{i = 1}^n S(x_i) $$ In other words, the expectation of the sufficient statistic. From above, we know that the partition function can be computed by $Z(\theta) = \int_X h(x) e^{\langle S(x), \theta \rangle} dx$, so we we can simplify by \begin{align*} \frac{\partial}{\partial\theta} \log(Z(\theta)) &= \frac{\frac{\partial}{\partial \theta} Z(\theta)}{Z(\theta)}\\ &= \frac{\int h(x) \frac{\partial}{\partial \theta} \mathrm{exp}(\langle S(x), \theta\rangle) dx}{Z(\theta)}\\ &= \frac{\int S(x) h(x) \mathrm{exp}(\langle S(x), \theta\rangle) dx}{Z(\theta)}\\ &= \mathbb{E}_{p(x|\theta)} [S(x)] \end{align*} Using the maximum-likelihood criteria, we get $$ \mathbb{E}_{p(x|\theta)} [S(x)] = \frac{1}{n} \sum_{i = 1}^n S(x_i) $$ The term on the right-hand side is $\mathbb{E}_{\mathbb{F}_n}[S(x)]$ which is the expectation of the empirical distribution $\mathbb{F}_n$. In other words, the MLE of an exponential family is found when the expectation of $S(x)$ is the same over $p(x|\theta)$ and the empirical distribution. We can therefore choose $\theta$ such that the expectation of the sufficient statistic over $p(x | \theta)$ is equivalent to the expectation of the sufficient statistic over our empirical distribution.

Performing expectation-maximization on an Exponential Family Model is easy when $\mathbb{E}_{p(x|\theta)} [S(x)] = \theta$. In the E-step, we compute the soft assignment weight matrix as $$ a_{ik}^{(j + 1)} = \frac{c_k^{(j)} p(x_i | \theta_k^{(j)})}{\sum\limits_{l = 1}^K c_l^{(j)} p(x_i | \theta_l^{(j)})} $$ and in the M-step we compute the proportions $c_k$ and parameters $\theta_k$ as $$ c_k^{(j + 1)} = \frac{\sum\limits_{i = 1}^n a_{ik}^{(j + 1)}}{n}, \theta_k^{(j + 1)} = \frac{\sum\limits_{i = 1}^n a_{ik}^{(j + 1)} S(x_i)}{\sum\limits_{i = 1}^n a_{ik}^{(j + 1)}} $$

The Multinomial Distribution

A random variable $\xi$ is categorical if it only takes on a finite set of values (categories), that is $\xi \in \{1, \ldots, d\}$ for some $d \in \mathbb{N}$, the number of classes. If we know that the probability that the variable is assigned to each category is $P(\xi = j) = t_j$ such that $t_j \ge 0$ and $\sum\limits_{j = 1}^d t_j = 1$, then we can parameterize our model by the vector of category probabilities $t_1, t_2, \ldots t_d$. We can represent a single sample as a vector, whose $j$th value is $1$ when the sample belongs to class $j$. To estimate the $t_j$s, we can denote the counts of each observed category $H = (H_1, \ldots H_d)$ (a histogram) and compute $t_j = \frac{1}{n}t_j$.

If each draw of $\xi$ is independent, we'd like to know what the probability of observing some histogram is, which we can compute as $$ P(H|t) = \frac{n!}{H_1! \cdots H_d!} \prod_{j = 1}^d t_j^{H_j} = \frac{n!}{H_1! \cdots H_d!} \mathrm{exp}\left( \sum_{j = 1}^d H_j \log (t_j) \right) $$ This is an exponential family model, with $S(H) = H$, $h(H) = \frac{n!}{H_1! \cdots H_d!}$, $\theta_j = \log(t_j)$, and $Z(\theta) = 1$. Intuitively, this gives a probability of $t_j$ to observing $\xi = j$ in a single draw. For $n$ draws, the probability that $\xi$ will be $j$ $n$ times is $t_j^n$. The number of ways $n$ elements can be subdivided into $d$ classes with $H_j$ elements falling into class $j$ is $\frac{n!}{H_1! \cdots H_d!}$. So, when computing $P(H|t)$ we are multiplying the total number of combinations by the probability of a single combination. The MLE of the parameter vector $t = (t_1, \ldots, t_d)$ is just the average across all observed histograms $\hat{t} = \frac{1}{n}(H_1, \ldots, H_d)$. The set of feasible vectors $t$ where $t_j \ge 0$ and $\sum\limits_{j = 1}^d t_j = 1$ are called the $d$-simplex in $\mathbb{R}^d$.

Example: Image Clustering

Given an greyscale image, we can represent it as a matrix of histograms where each histogram is the number of pixels around a small window which are of a certain shade. We can then model the image as a mixture of multinomials, where membership in each multinomial denotes clustering assignment.

Example: Text Models

Consider the setting where we have a huge collection of text documents. This complete set is our corpus. We would like to learn a model which describe natural language based on text documents written by humans. In a given document, many of the words will appear more than once. The list of unique words are the terms in our dictionary. If we are given a dictionary of all of the terms which appear in a given document, we can represent the document as a histogram $H = (H_1, \ldots H_d)$ where $H_j$ is the number of occurrences of term $j$ in the document. $d$ is the number of unique words in the dictionary (across all documents - it can be huge), and is the same across documents; however, $\sum_j H_j$ may vary across documents (e.g. there are a different number of words in each document). We can use the multinomial distribution as a simple probabilistic model of document generation. We denote $P(H | t, n)$ as this distribution where $t$ is the probability of each word and $n$ is the number of words to be drawn. This assumes that the word order doesn't matter (called bag-of-words).

We can extend this model to consider the context, or the co-occurrence and order of words. In language, the probability that you will see a certain word depends a great deal on the words you saw before it (the context). If we have tons of data, we can look at the number of times each word appears given the $N$ previous words - this is called an $N$-gram model. The simplest $N$-gram model is the bigram model, where you look at the probability of a word given only the previous word, i.e. $\mathrm{Pr}(w_l | w_{l - 1})$. Now, we need a new multinomial distribution for each possible previous word, so that we have a family of $d$ multinomial distributions. To estimate this model, we need to find all terms in the library which are preceded by $k$ and record their number o occurrences, giving a set of vectors $H_k = (H_{k1}, \ldots, H_{kd})$ where $H_{kj}$ is the number of times the term $j$ follows $k$. We then want to estimate the multinomial vector $t_k$ for each of these multinomials. In a general $N$-gram model, we try to preduct $\mathrm{Pr}(w_l|w_{l - 1}, \ldots, w_{l - (N - 1)})$. This can quickly get intractable for large datasets - if there are $T$ words, then we must consider $T^N$ possible combinations of words. If we need $n$ words to get a reliable estimate, we need $nT^N$ words. So, in practice you usually only encounter bigrams and trigrams.

Now, say hwe have a corpus of text where documents are divided into (say 2) distinct classes. We assume that the corpus is generated by a multinomial mixture model $\pi(H) = \sum\limits_{k = 1}^K c_k P(H | t_k)$ where $P(H|t_k)$ is multinomial. If we have two classes, we will have two multinomial distributions whose vectors $t_k$ denote the occurrence of words in each topic. We are implicitly assuming that each document consists of one topic.

Aside: Information Theory

Given a random variable $X$ with some distribution $P$, the distribution $P$ describes what we know before we observe $X$. We want to be able to quantify how much information we gain by actually observing $X$. This is the information content of $X$. If there's no randomness in the source at all (the distribution always produces the same value), then we gain nothing by observing $X$. More generally, if the probability of seeing a particular outcome is large, there's no big surprise to seeing that outcome. However, if the probability of some outcome is small and we observe it, we gain information. We say the information in observing $X = x$ under $P$ is $J_P(x) = -\log(P(x))$. In this way, the information gain from two unrelated observations is additive. We would like to know the expected information gain of a draw from the distribution, which we denote and compute as $\mathbb{H}[X] = \mathbb{E}_P[J_P(X)]$, called the entropy of $X$. We can compute the entropy by $$ \mathbb{E}[J_P(X)] = -\mathbb{E}_P[\log(P(X))] = -\sum_{x \in X} P(x) \log (P(X)) $$ We can write $\mathbb{H}$ as a function of $P$ or $X$. The entropy is always non-negative; it will be $0$ when a single term has probability $1$ - so we know beforehand exactly what is going to happen. If $P$ is a uniform distribution such that $X$ takes on one of $d$ values, then $\mathbb{H}$ is maximized. In general, a higher dimension means that sampling the distribution is less informative.

Alternatively, we can define entropy by defining some measure $\mathcal{H}[X]$ or $\mathcal{H}(P)$ of the information of $X$ which satisfies some properties that we want it to have. We require that if $X$ and $Y$ are independent, then their information content is disjoint, so that $\mathcal{H}[X, Y] = \mathcal{H}[X] + \mathcal{H}[Y]$. More generally, we should be able to remove the joint information in $X$ by conditioning on $Y$. That is, we should have $\mathcal{H}[X, Y] = \mathcal{H}[X] + \mathcal{H}[Y | X]$. To keep the measure “nice”, we would like it to have continuity in the distribution, so that if we make a small change in $P$ then $\mathcal{H}(P)$ will not jump. Finally, it should be monotonic, so that if we (for example) increase the number of possible outcomes the information gained by sampling should be higher. If $U_d$ is the uniform distribution in $d$ dimensions, we must have $\mathcal{H}(U_d) < \mathcal{H}(U_{d + 1})$. If you combine these axioms, then $\mathcal{H}(P) = c\mathbb{H}(P)$. That is, any measure which satisfies these requirements is information gain. In general, most “reasonable” axioms result in an entropy measure which is the same up to scaling.

We can apply this idea to coding - where we take data and compress it, without throwing away any of the data. A simple way to encode text would be to measure the frequency with which each word occurs and assign short codes to common words and long codes to less common words. This is in contrast to encoding each word simply according to its letters (which scales according to length). This idea is called Huffman coding, and is optimal if all we can do is replace words by codewords. If we know the distribution of words in texts, we can use information theory to compute the expected compression rate and check whether our encoder achieves this optimal rate. If our distribution over words is $P$, and we obtain a sequence $X^n = (X_1, \ldots, X_n)$ of random observations sampled iid from $P$, then for any $\epsilon$ there is some encoder for which $$ \mathbb{E}\left[\frac{1}{n} \mathrm{length(encoding(}X^n))\right] \le H(P) + \epsilon $$ This essentially says that the entropy is a lower bound for lossless compression. This means that on average we can encode $X^n$ without loss of information using $nH(P)$ bits of information. This is becomes true as $n \rightarrow \infty$, for small $n$ we may achieve a better result with some encoder but the probability that we will lose some information with that encoder approaches $1$ as $n$ grows. For very small word lengths (e.g. characters), we can get close to the entropy by just encoding trivially. When coding words instead, there are more words that are extremely common; the occurrence of words is well-modeled by a Zipf distribution. This means that the entropy is very low, so we can encode it very well.

The Kulback-Leibler divergence gives us a notion of “how far” two distributions are from one another, although it's not strictly a distance. It uses information to compare two distributions. Heuristically, we can say we want to compare two distributions $P$ and $Q$ on $X$. We are interested in measuring the amount of information we gain in terms of $Q$ by taking a random sample from $P$. This is computed by $\mathbb{E}_P[J_Q(X)]$. However, we want this measure of difference to be zero when $P = Q$, but $\mathbb{E}_P[J_Q(X)] = \mathbb{H}(P)$ when $P = Q$. So, we normalize by subtracting $\mathbb{H}(P)$, giving $$ D_{KL}(P \| Q) = \mathbb{E}_P[J_Q(X)] - \mathbb{H}(P) = \mathbb{E}_P[J_Q(X) - J_P(X)] = \sum_{x \in X} P(x) \log\left(\frac{P(x)}{Q(x)}\right) $$ the Kulback-Leibler divergence. This quantity is always positive and when $D_{KL}(P \| Q) = 0$ it follows that $P = Q$. The KL divergence is not symmetric; $D_{KL}(P \| Q) \ne D_{KL}(Q \| P)$ in general, and it does not satisfy a triangle inequality. However, $D_{KL}(P \| Q)$ is convex in the pair $(P, Q)$.

We can condition the entropy on some observations $X$ by computing the entropy using the expectation conditioned on $P(x)$, giving $$ \mathbb{H}[Y | X] = -\sum_{x, y \in X} P(x, y) \log( P(y | x) ) $$ We can alternatively ask how much observing $X$ tells us about $Y$. When they share no information, then they are independent and $P(x, y) = P(x)P(y)$. We know how to compute the difference between two distributions using the KL-divergence, so we can just compare the actual joint distribution $P(x, y)$ with the independent case $P(x)P(y)$. This gives us $$ I[X, Y] = \mathbb{H}[X, Y] - \mathbb{H}[Y | X] $$ the mutual information of $X$ and $Y$. Note that conditioning reduces entropy because $\mathbb{H}[X, Y] = \mathbb{H}[Y | X] + \mathbb{H}[x]$. We can write out the mutual information using KL divergence, giving $$ I[X, Y] = D_{KL} [P(x, y)\|P(x)P(y)] = \sum_{x \in X} P(x, y) \log \left( \frac{P(x, y)}{P(x)P(y)} \right) $$ From this, we get the property that the mutual information is $0$ if and only if $X$ and $Y$ are independent. To estimate these quantities from data, you need very large amounts of data.

We can also compute these quantities for continuous probability density functions. For entropy, we get the differential entropy $$ \mathbb{H}[X] = -\int_X p(x) \log(p(x)) dx $$ Since $p$ is a density, it can be below or above $1$, so this integral can become negative, which is why we call it differential. The mutual information becomes $$ I[X, Y] = \int_X p(x, y) \log\left(\frac{p(x, y)}{p(x)p(y)} \right) d x $$ and the KL divergence becomes $$ D_{KL}(p \| q) = \int_X p(x) \log\left(\frac{p(x)}{q(x)} \right) d x $$ These quantities are still non-negative, and still have the properties described above.

A useful application of these tools is to choose a model (set of distributions) and find the distribution in the model which minimizes the KL-divergence between the empirical distribution and the distributions in our model, formulated as $$ \hat{\theta} = \mathrm{arg}\min_{\theta \in \mathcal{T}} D_{KL} (\mathbb{F}_n | p(x | \theta) ) $$ where $p(x | \theta)$ is a distribution from our model $\mathcal{P} = \{p(x | \theta) | \theta \in \mathcal{T} \}$ and $\mathbb{F}_n$ is the empirical distribution. This exactly gives us the maximum likelihood estimator.

We can also apply these methods as a modeling approach, where we choose a distribution out of a set of admissible distributions if it maximizes the entropy. This chooses the distribution which relies on the fewest assumptions. We then choose our model distribution as $P = \mathrm{arg}\max_{Q \in \mathcal{P}} \mathbb{H}(Q)$. For example, if we set $\mathcal{P}$ to be the set of distributions with a certain variance, then maximum entropy estimation will choose the Gaussian distribution. If we formulate our constraints as functions $S_1, \ldots, S_m$ from our sample space to $\mathbb{R}$, then the constrained set is $\mathcal{P} = \{ Q | \mathbb{E}_Q[S_1(X)] = s_1, \ldots, \mathbb{E}_Q[S_m(X)] = s_m \}$. If we include these constraints into our maximum entropy optimization using Lagrange multipliers, then we get an exponential family model.

In traditional statistics, we often measure the variance, covariance, and $\chi^2$; these are low-order approximations to entropy, mutual information, and KL-divergence. So, we can often derive methods by substituting statistical quantities with information theory ones.

Bayesian Models

Bayesian statistics assumes that not only is data randomly sampled from some distribution $P$, but $P$ itself is a random quantity which is sampled from $Q$, or formally, $P \sim Q$ and $X_1, X_2, \ldots \sim_{iid} P$. Bayesian statistics argues that we should treat any uncertainty using probability distributions, and since we don't explicitly know $P$, we also treat it as sampled from some distribution. In a more vague sense, we can view the parameters of our model as encapsulating the pattern in some data, and we can view the goal of statistics to extract this pattern. In Bayesian statistics, the pattern itself is modeled as a random quantity. The randomness in $Q$ reflects our lack of knowledge about $P$. The distribution $Q$ is the a priori or prior distribution. We'd like determine the probability of $P$ given our data, which we express as $\Pi[P|x_1, \ldots, x_n]$, called the a posteriori or posterior distribution. In general the posterior is impossible to compute, but when the model is parametric it can be obtained using Bayes equation. If $P$ is an element of a parametric model so that its density $p$ is in the family $\mathcal{P} = \{p(x | \theta) | \theta \in \mathcal{T}\}$ then we can express the prior and posterior as distributions on $\mathcal{T}$ by $q(\theta)$ and $\Pi(\theta | x_1, \ldots x_n)$. Sampling then is of the form $\theta \sim q$ and $X_1, X_2, \ldots \sim_{iid} p(\cdot | \theta)$. We can compute the posterior under the data $X_1, \ldots, X_n$ by $$ \Pi(\theta | x_1, \ldots, x_n) = \frac{\left(\prod_{i = 1}^n p(x_i | \theta) \right)q(\theta)}{p(x_1, \ldots, x_n)} = \frac{\left(\prod_{i = 1}^n p(x_i | \theta) \right)q(\theta)}{\int_\mathcal{T} \left(\prod_{i = 1}^n p(x_i | \theta) \right)q(\theta) d\theta} $$ where $q(\theta)$ is called the prior, $\prod_{i = 1}^n p(x_i | \theta)$ is called the likelihood, and $p(x_1, \ldots, x_n)$ is called the evidence.

Example: Gaussian-mean Gaussian

We assume that our data is sampled from a Gaussian with fixed variance $\sigma^2$ but with unknown mean and we assume that the data mean is sampled from a Gaussian distribution. Formally, if $g$ is a Gaussian density, then $p(x | \theta, \sigma) = g(x | \mu, \sigma)$ and $q(\mu) = g(\mu | \mu_0 , \sigma_0)$. The resulting posterior $\Pi(\mu | x_{1:n})$ is itself a Gaussian, with mean $$ \frac{\sigma^2\mu_0 + \sigma^2_0 \sum\limits_{i = 1}^n x_i}{\sigma^2 + n\sigma_0^2} $$ and standard deviation $\frac{\sigma \sigma_0^2}{\sigma^2 + n\sigma_0^2}$. As more datapoints are obtained, the posterior distribution moves its mean closer to the true data mean and its standard deviation decreases.

MAP Estimation

If $\Pi(\theta | x_{1:n})$ is the posterior of a Bayesian model, then the maximum a posteriori (MAP) estimator $\hat{\theta}_{ML}$ is the parameter setting which maximizes the posterior, or explicitly $\hat{\theta}_{ML} = \mathrm{arg}\max_{\theta \in \mathcal{T}} \Pi(\theta | x_{1:n})$. MAP estimation is used when we would like to obtain a single parameter setting which is optimal in some sense, as is done e.g. in Maximum Likelihood estimation. Bayesian statistics produces a posterior instead of a point estimate; MAP estimation retrieves a point estimate from the posterior. Taking the logarithm does not affect the location of the maximum, so we can instead compute $\hat{\theta}_{MAP} = \mathrm{arg}\max_{\theta \in \mathcal{T}} \log (\Pi(\theta | x_{1 : n})$. Plugging in Bayes equation gives $$ \hat{\theta}_{MAP} = \mathrm{arg}\max_{\theta \in \mathcal{T}} \sum_{i = 1}^n \log(p(x_i | \theta)) + \log(q(\theta)) - \log(p(x_1, \ldots, x_n) = \mathrm{arg}\max_{\theta \in \mathcal{T}} \sum_{i = 1}^n \log(p(x_i | \theta)) + \log(q(\theta)) $$ where the simplification was made because log-evidence does not depend on $\theta$. We can therefore view MAP estimation as regularized ML estimation, where the regularization term $\log(q(\theta))$ causes MAP estimation to prefer values of $\theta$ which have high probability according to the prior.


The prior is typically chosen as an element of a standard parametric family $\mathcal{Q} = \{q(\theta | \phi) |\phi \in \mathcal{H}\}$ where $\mathcal{H}$ are the space of “hyperparameters” $\phi$ of the model. When the parametric model $\mathcal{P}$ of $p$ is an exponential family model such that $p(x | \theta) = \frac{h(x)}{Z(\theta)}e^{\langle S(x) | \theta \rangle}$ and the prior distribution is of the exponential family form $$ q(\theta | \lambda, y) = \frac{\mathrm{exp}\left(\langle \theta | y \rangle - \lambda \log(Z(\theta))\right)}{K(\lambda, y)} $$ (called the natural conjugate prior) where $\lambda \in \mathbb{R}^+$ and $y \in \mathcal{T}$ are the hyperparameters then we can use Bayes equation to compute \begin{align*} \Pi(\theta | x_1, \ldots, x_n) &= \frac{\prod_{i = 1}^n p(x_i | \theta)}{p(x_1, \ldots, x_n)} q(\theta)\\ &= \frac{\prod_{i = 1}^n \frac{h(x_i)}{Z(\theta)}e^{\langle S(x_i) | \theta \rangle}}{p(x_1, \ldots, x_n)} \frac{\mathrm{exp}\left(\langle \theta | y \rangle - \lambda \log(Z(\theta))\right)}{K(\lambda, y)}\\ &= \frac{ \frac{\prod_{i = 1}^n h(x_i)}{Z(\theta)^n} \mathrm{exp}(\left\langle \sum_{i = 1}^n S(x_i) | \theta \right\rangle)}{p(x_1, \ldots, x_n)} \frac{\mathrm{exp}\left(\langle \theta | y \rangle - \lambda \log(Z(\theta))\right)}{K(\lambda, y)}\\ &\propto \frac{\mathrm{exp}(\left\langle \sum_{i = 1}^n S(x_i) | \theta \right\rangle)} {Z(\theta)^n} \mathrm{exp}\left(\langle \theta | y \rangle - \lambda \log(Z(\theta))\right)\\ &= \frac{\mathrm{exp}(\left\langle y + \sum_{i = 1}^n S(x_i) | \theta \right\rangle)}{Z(\theta)^{\lambda + n}}\\ &= \mathrm{exp}\left(\left\langle y + \sum_{i = 1}^n S(x_i) | \theta \right\rangle -(\lambda + n)\log(Z(\theta)) \right)\\ &\propto q\left(\theta | \lambda + n, y + \sum_{i = 1}^n S(x_i) \right) \end{align*} where the first proportionality simplification was made by ignoring terms which do not depend on $\theta$. The final proportionality is actually an equality because the distribution must be normalized. In summary, if $\mathcal{P}$ is an exponential family model and $q(\theta | \lambda, y)$ is the natural conjugate prior defined above, then we can compute the posterior by updating the hyperparameters. In general, we can say that $\mathcal{P}$ and $\mathcal{Q}$ are conjugate when for each sample size $n \in \mathbb{N}$ there exists a function $T_n : X^n \times \mathcal{H} \rightarrow \mathcal{H}$ such that $\Pi(\theta | x_1, \ldots, x_n) = q(\theta | T_n(x_1, \ldots, x_n, \theta))$. More generally, if the posterior is a member of the product family for some hyperparameter setting (not necessarily computed by some function $T_n$), we call the model closed under sampling. In practice, the only parametric models which admit conjugate priors are exponential family models. As derived above, in the specific case of the exponential family $\mathcal{P}$ with natural conjugate family $\mathcal{Q}$, the posterior is computed with the hyperparameter update $$ T_n( x_1, \ldots, x_n, \lambda, y) = \left(\lambda + n, y + \sum_{i = 1}^n S(x_i) \right) $$ The expected value of the natural conjugate prior is $y$, and the parameter $\lambda$ a concentration, such that large values of $\lambda$ indicate a distribution which peaks sharply around $y$. In this way, when we update the hyperparameters we are linearly interpolating our prior guess with the sufficient statistic of the seen data and we are also increasing the prior concentration by $n$, so more data means more concentration.

Hierarchical Models

We can regard a Bayesian model with prior $q$ as a decomposition of the underlying data distribution $p(x)$ into $p(x_{1 : n}) = \int_\mathcal{T} \int_i p(x_i|\theta)q(\theta) d\theta$. We can perform a further decomposition on the prior to obtain $q(\theta) = \int q(\theta | \phi) \tilde{q}(\phi) d \phi$, so that hyperparameters $\phi$ are sampled from $\tilde{q}$, parameters are sampled from $q(\cdot | \phi)$, and the data is modeled as being sampled from $p(\cdot | \theta)$. We can continue on this way, splitting each distribution recursively. All of the distributions which have been split out combine together to create the single prior $p$. This is useful when the intermediate distributions and their parameters have a well-defined meaning. We can then use well-known models as building blocks to a larger hierarchical model. Inference is made more difficult as we add layers, and can be done using Markov chain sampling.

Sequential Models

Many data sources are sequential. One example are “simple random walks” on graphs - sequences of vertices which start at a vertex $X_1$ at random, choose a vertex $V_2$ from the neighbors of $V_1$ at random, and continue on in this fashion choosing $X_n$ from the neighbors of $X_{n - 1}$. Such sequences are not i.i.d. - each value depends on the previous value. Now, we will consider sequences where $X_n$ is dependent on the samples already observed, that is $P(X_n) = P(X_n | X_1, \ldots, X_{n - 1})$. Many real-world problems have data which is inherently sequential, e.g. speech recognition, handwriting recognition, time series, simulation algorithms (MCMC), random walk models, etc.

Markov Models

A Markov model makes the assumption that we don't need to look at all samples observed so far, but instead only the previous $r$ steps so that $P(X_n | X_{n - 1}, \ldots, X_1 ) = P(X_n, | X_{n - 1}, \ldots, X_{n - r})$. This is called a Markov chain of order $r$. A Markov chain is the simple case where $r = 1$. The first state in the distribution is special because it doesn't have past, and is denoted $X_0$. This is an explicit assumption that $X_n$ is independent of $X_{n - r - 1}$ and all samples which came before it. Markov models can be represented as graphs, where nodes are the possible values of the random variable, and edges denote a probability of observing some value given another, written $p_{s \rightarrow t} = \mathrm{Pr}\{X_n = t | X_{n - 1} = s\}$. The possible values of the random variable are the state space of the Markov chain, and each possible value is a state. The number of states can be infinite but is usually finite in practice. You can still represent an i.i.d. random variable in this way, as long as the incoming edges for each state have the same weight/probability. The probabilities on the edges must not change over time. This type of Markov chain is called stationary (the probabilities do not change with $n$). We can represent the probability of a transition from one state to another using a transition matrix $\mathbf{p}_{i, j} = p_{i \rightarrow j}$. We also need to specify the probability of the starting state, which we can represent as a vector $P_{init} = [Pr\{X_0 = 1\}, \ldots, Pr\{X_0 = d\}]$. This provides a complete description of a stationary Markov chain. We can write the simple random walk described above as a Markov chain, with $P_{init} = [1/d, \ldots, 1/d]$ and $p_{i \rightarrow j} = 1/N_i$ where $N_i$ is the number of edges out of node $i$. We can generalize the simple random walk by substituting the uniform distributions with arbitrary distributions. The concept of a random walk on a weighted graph and a Markov chain on a finite state space are equivalent.

Page ranking algorithms seek to order the pages by importance or usefulness which match a certain search query. This step is carried out after the list of pages matching the query has been found (which can be a huge number of pages). We can model the web as a large graph $G$, where nodes are web pages and edges are hyperlinks between different web pages. A simple score would be to compute a pages “popularity” by computing the number of edges which connect to (link to) a given vertex (page). Google's PageRank assumes that the popularity of a page is proportional to $P_n(x) = \mathrm{Pr}\{X_n = n\}$ where $X_1, \ldots, X_n$ is a simple random walk (by a “random surfer”) on $G$. This model makes two assumptions: Better pages are pages linked more often, and that a link from a high-quality page is more important than a link from a low-quality page (because you are taking account of the quality of pages linking to a page). If we know the initial state $X_0$ is $s_0$, then the probability of $X_1 = s_1$ is $p_{s_0 \rightarrow s_1}$. We can find the probability $P_1$ that $X_1$ is $s_1$ if we do not know the starting state by summing over all possible start states and computing \begin{align*} P_1(s_1) &= \mathrm{Pr}\{X_1 = s_1\}\\ &= \sum_{s_0 \in X} \mathrm{Pr}\{X_1 = s_1 | X_0 = s_0\} P_{init}(s_0)\\ &= \sum_{s_0 \in X} p_{s_0 \rightarrow s_1} P_{init}(s_0)\\ &= \mathbf{p}P_{init} \end{align*} For an arbitrary further step $n$, we can generalize this quantity and compute $P_n = \mathbf{p}^n P_{init}$. Really, we'd like to see what happens when $n$ is really large. So, we can take the limit $P_\infty = \lim_{n \rightarrow \infty} P_n = \lim_{n \rightarrow \infty} \mathbf{p}^n P_{init}$ provided that this limit exists. If the limit does exist, then $\mathbf{p}P_\infty = \mathbf{p}\lim_{n \rightarrow \infty} P_n = P_\infty$. We therefore seek a distribution $P$ on $X$ such that $\mathbf{p}P = P$, called the equilibrium distribution. This distribution may not be unique and not exist depending on the structure of the graph. When the distribution is not unique, then we are effectively saying that we can get “trapped” in part of the graph. So, if we require that there is a path in the graph from every state to every other state (called an irreducible Markov chain), then the equilibrium distribution is unique. If the sequence of states will never converge, then the equilibrium distribution will not exist. We can enforce this by requiring that the probability of staying in a given state is greater than zero, which we call an aperiodic Markov chain. This results in a transition matrix with non-zero diagonal. It is in fact provable that if a stationary Markov chain $(\mathbf{p}, P_{init})$ has $p_{s_1 \rightarrow s_2} > 0$ for any two states $s_1, s_2 \in X$ then $P_\infty$ exists, is unique, and is the equilibrium distribution. We can approximate the equilibrium distribution by multiplying $P_{init}$ by $\mathbf{p}$ many many times, computing $P_{n+1} = \mathbf{p}P_n$ until $\|P_{n + 1} - P_n\| < \tau$. This is called the Power method, and is basically an eigenvector computation, because the criteria $P = \mathbf{p}P$ means that $P = P_\infty$ is an eigenvector of $\mathbf{p}$ with eigenvalue $1$. For a Markov chain, if $\mathbf{p}$ is irreducible and aperiodic then $1$ is the largest eigenvalue, and we can find the largest eigenvalue by repeatedly multiplying by $\mathbf{p}$.

In practice, the web graph may not be irreducible and aperiodic because there is not a link from every page to every other page. So, if we construct the transition matrix $A$ by $A_{i, j} = 1/N_i$ where $N_i$ is the number of links out of page $i$, then we can set $\mathbf{p} = (1 - \alpha)A + \frac{\alpha}{d} \mathbf{1}$ where $\mathbf{1}$ is the matrix of ones for some small $\alpha$. The yields a regularized probability transition matrix which describes a Markov chain which is irreducible and aperiodic. Also in practice, the web changes, so we compute the equilibrium distribution using the power method using an initial distribution of the previous equilibrium distribution. Finally, if we assume that the random surfer is more likely to start on a popular page than an unpopular one, then we can sample $X_0 \sim P_{eqn}$ where $P_{eqn}$ is the equilibrium distribution of the chain. However, note that after following $n$ links the distribution is still $P_{eqn}$ because $P_n = \mathbf{p}^n P_{eqn} = P_{eqn}$. So, $P_{eqn}$ is a good PageRank solution - we don't have to worry about $n$.

Hidden Markov Models

In a Markov model, we assume that a given sample is only dependent on the sample before it. Hidden Markov Models provide a way to model long-range dependencies in a tractable way. HMMs use a Markov chain to generate a sequence of latent (or “hidden”) variables. These models can then generate sequences of observations with long-range dependencies, even though the hidden variables are Markovian. We call the Markov chain $(Q_{init}, \mathbf{q})$ where $Q_{init}$ is the initial state distribution and $\mathbf{q}$ is the transition matrix with states $\{1, \ldots, K\}$ as well as an emission distribution $P(x | c)$. To generate a sequence, the model generates samples $Z_1, Z_2, \ldots$ from the Markov chain, then sample $X_1, X_2, \ldots$ by independently sampling from $P(\cdot | Z_i)$. Each output sample in the sequence is therefore only dependent on a single sample from the hidden sequence. In a continuous HMM, the variables $X_i$ have continuous distributions and $P$ will take the form of a density $p(x | z)$.

Example: Dice

Consider a game involving two dice, one which is fair, and one which is loaded. At each roll, you can either keep the current dice or switch to the other dice with a certain probability. We then roll the newly chosen dice. The observed values are the numbers we record from the dice, so the sample space is $X = \{1, \ldots, 6\}$. The state is whether we are rolling the fair dice or the loaded dice, so $Z_n \in \{fair, loaded\}$. We are not able to observe which dice we are rolling at a given point in time. The transition matrix is given by $\mathbf{q} = [.95, .05;, .1, .9]$. Finally, the emission probabilities are all $\frac{1}{6}$ for the fair dice and are $\frac{5}{10}$ for a 6 on the loaded dice and $\frac{1}{10}$ for all other outcomes.

HMMs can be applied in problems of various types. If we know the transition matrix $\mathbf{q}$, the emission distribution $P(\cdot | z)$, and the observed sequence $x_{1:N} = (x_1, \ldots, x_N)$, we can attempt to estimate the probability of each hidden variable $Q(Z_i = k | x_{1 : n})$. This is called a filtering problem; a smoothing problem uses all observations $x_{1:N}$. In a decoding problem, we know $\mathbf{q}$ and $P(\cdot | z)$ and the observed sequence $x_{1:N}$, and we seek the maximum likelihood estimate of the hidden state sequence $\hat{z}_{1:N}$. In the learning case, we only see the observed sequence and we seek to find the entire model ($\mathbf{q}$ and $P(\cdot | z)$). In all cases, you are formulating estimations for each time step and chaining them together using the Markov assumption.


In filtering, we know the transition matrix $\mathbf{q}$, the emission distribution $P(\cdot | z)$, and the observed sequence $x_{1:N} = (x_1, \ldots, x_N)$ and we seek to determine $Q(Z_n = k|x_{1:n})$. We denote the desired matrix $\hat{Q}_{nk} = Q(Z_n = k | x_{1 : n})$. Using Bayes' equation gives $$ Q(z_n | x_{1 : n}) = Q(z_n | x_n, x_{1:(n - 1)}) = \frac{P(x_n|z_n, x_{1:(n - 1)} )Q(z_n | x_{1:(n - 1)})}{\sum_{z_n = 1}^K P(x_n|z_n, x_{1:(n - 1)}) Q(z_n|x_{1:(n - 1)})} $$ The important term is $Q(z_n | x_{1:(n - 1)})$. We can compute this term by reducing to the previous step, by decomposing $$ Q(Z_n = k | x_{1 : (n - 1)}) = \sum_{i = 1}^K Q(Z_n = k | Z_{n - 1} = l) Q(Z_{n - 1} = l|x_{1 : (n - 1)}) $$ The first term in the summation is just the entry $\mathbf{q}_{lk}$ in the transition matrix and the second term is just $\hat{Q}_{(n - 1)l}$. So, we can compute the numerator in the Bayes equation by $$ a_{nk} = P(x_n | z_n) \sum_{i = 1}^K \mathbf{q}_{lk} \hat{Q}_{(n - 1)l} $$ and we can set $$ \hat{Q}_{nk} = \frac{a_{nk}}{\sum_{j = 1}^K a_{nj}} $$ This algorithm is called the forward algorithm.


We can avoid making an assumption on the form of $\mathbf{q}$ by learning it from the data. We do have a model assumption for $P(\cdot | z)$, but we still need to estimate its parameters. If each state is defined by a parameter value $\theta_k$, then this setting is very similar to a mixture model. If you focus on a single state and its observation, then $Z_n$ is a variable which takes one of $K$ values; this selects a model setting which is used to generate $x_n$. This is exactly a mixture model as we have seem in clustering - so that an HMM is a chain of mixture models. In the HMM, we estimate the transition probabilities and state probabilities in the E-step, and estimate the component parameters in the M step. We must also estimate the transition matrix $\mathbf{q}$ of the Markov chain. In the M step we also do weighted maximum likelihood, except we use state probabilities in place of assignment probabilities. In the E step, we have to compute a new state probability, conditioned on the parameters from the previous step. Estimating the transition probabilities is like the filtering problem, except that it is solved recursively.

Example: Speech Recognition

In the speech recognition problem, you are given a sound signal which is represented as a sequence of feature vectors (e.g. MFCCs) and seek to determine the words which have been spoken. Each word is broken down into phonemes, and the states in the EM correspond to each phoneme. The sequence of feature vectors is the observed sequence. The HMM must first be trained using some labeled data, then we must solve a filtering problem to actually translate some query speech. To adapt to different speakers, the emission probabilities can be updated by making small changes in the covariance matrices of the emission probability matrices.

stat_w4400_course_notes.txt · Last modified: 2015/12/17 21:59 (external edit)