# http://colinraffel.com/wiki/

### Site Tools

elen_e6892_course_notes

# ELEN E6892 Course Notes

## Probability Basics

Given a space $\Omega$ containing points partitioned into different regions $A$ and $B$, the probability that a point drawn at random comes from $A$ can be computed by the number of points in $A$ divided by the total number of points in $\Omega$. The probability that a point falls in $A$ given that it falls in $B$ is the number of points that fall in both $A$ and $B$ divided by the total number of points in $B$ or $P(x \in A | x \in B) = P(x \in A, x \in B)/P(x \in B)$. It follows that $P(A, B) = P(A|B)P(B)$ for two events $A$ and $B$ (the product rule). By symmetry, we have $P(B, A) = P(A, B) = P(B | A)P(A)$ which leads to Bayes' rule $$P(A | B) = \frac{P(B | A)P(A)}{P(B)}$$ The term $P(B)$ is essentially a normalizer because $P(A|B)$ is a function of $A$, so we can instead simply write $P(A|B) \propto P(B | A)P(A)$.

$P(A|B)$ is the conditional probability - the probability that $A$ will happen given that $B$ has happened. The joint probability $P(A, B)$ is the probability that both $A$ and $B$ will happen. $P(B)$ can either be called the probability of $B$ or the marginal (or prior) probability of $B$, where the latter indicates that there's some other event that we are marginalizing out. In general, the marginal probability is a way to compute the probability that an event will happen by summing over the probabilities that it and every other event will happen, so that $P(X) = \sum_Y P(X, Y)$ (the sum rule).

##### Example

Let $a$ be a Bernoulli random variable which is $1$ when some person has some disease and $0$ when they don't, and let $b$ be a Bernoulli random variable which is $1$ when they test positive for the disease and $0$ when they don't. Let $P(b = 1 | a = 1) = P(b = 0 | a = 0) = 0.95$, and let $P(a = 1) = 0.01$. Using Bayes' rule, we have $P(a = 1|b = 1) = \frac{P(b = 1 | a = 1)P(a = 1)}{P(b = 1)} = 0.16$. Essentially, the low probability that $a = 1$ corrects the estimate and makes it lower.

#### Continuous Random Variables

Now, consider the case where the values of a random variable can be continuously-valued. To call a function in some space $\Omega$ a probability density, we require that $p(x) \ge 0\; \forall x \in \Omega$ and that $\int_\Omega p(x) d x = 1$ and $p(A) = \int_A p(x) d x$. Similarly to the discrete random variable case, we have $p(x | y) = p(x, y)/p(y)$ and $p(x | y)p(y) = p(x, y) = p(y | x)p(x)$ and $$p(x | y) = \frac{p(y | x)p(x)}{p(y)} = \frac{p(y | x)p(x)}{\int_{\Omega_x} p(y | x)p(x) dx}$$

#### Expectation, Variance and Covariance

The expectation of some function $f(x)$ under a probability distribution $p(x)$ is computed by $\mathbb{E}[f] = \sum_x P(x)f(x)$ for discrete distributions and $\mathbb{E}[f] = \int p(x)f(x) dx$ for probability densities of continuous $x$. It can be estimated empirically by $$\mathbb{E}[f] \approx \frac{1}{N}\sum_{n = 1}^N f(x_n)$$ The variance of $f(x)$ can be computed as $\mathrm{Var}[f] = \mathbb{E}[(f(x) - \mathbb{E}[f(x)])^2]$. For two random variables $x$ and $y$ we can compute the covariance as $\mathrm{cov}[x, y] = \mathbb{E}_{x, y}[(x - \mathbb{E}[x])(y^\top - \mathbb{E}[y^\top])]$.

In the special case where $f(x) = x$ we have $\mathbb{E}[f] = \int p(x)x dx$ and \begin{align*} \mathrm{Var}[x] &= \int (x - \mathbb{E}[x])^2 p(x) dx\\ &= \int x^2 p(x) dx - 2\mathbb{E}[x] \int x p(x) dx + \mathbb{E}[x]^2\\ &= \mathbb{E}[x^2] - \mathbb{E}[x]^2 \end{align*}

### Models

The general modeling assumption is that the data $y$ is drawn from some distribution parameterized by $x$, and that the model parameters themselves are drawn from some other distribution so that $y \sim p(y | x)$ and $x \sim p(x)$. In this way, we have $p(x | y) \propto p(y | x)p(x)$. When constructing our model, we make explicit assumptions about $p(y | x)p(x)$ but do not know $p(x | y)$ (the probability of a parameter value given the data); we seek to be able to make decisions about parameters given our data.

##### Example: Beta-Bernoulli

Let $X$ be a Bernoulli random variable with $P(X = 1) = \pi$ and $P(X = 0) = 1 - \pi$. We can then write that $P(X = x | \pi) = \pi^x (1 - \pi)^{1 - x}$. Now, we sample $X$ a number of times to obtain $X_i \sim_{iid} P(X | \pi)$. We can then compute \begin{align*} P(X_1, \ldots, X_N | \pi) &= \prod_{i = 1}^N P(X_i | \pi)\\ &= \prod_{i = 1}^N \pi^{X_i} (1 - \pi)^{1 - X_i}\\ &= \pi^{\sum X_i} (1 - \pi)^{N - \sum X_i} \end{align*} We can search for the value of $\pi$ which maximizes this quantity by computing $$\pi_{ML} = \mathrm{\arg}\max_\pi p(X_{1:N} | \pi) = \frac{1}{N} \sum_i^N X_i$$ However, we become more confident in our estimate of $\pi$ as $N$ increases. Using Bayes' rule, we can say that $P(\pi | X_{1 : N}) \propto P(X_{1 : N} | \pi)p(\pi)$. This will give a posterior probability distribution on $\pi$. If we don't have any idea of what $\pi$ should be, we can set $p(\pi)$ to be the uniform distribution over $[0, 1]$. Then we have $$P(\pi | X_{1 : N}) \propto \pi^{(\sum x_i + 1) - 1} (1 - \pi)^{(N - \sum X_i + 1) - 1} p(\pi)$$ The Beta distribution is defined as $$\mathrm{Beta}(a, b) = p(\pi) = \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \pi^{a - 1}(1 - \pi)^{b - 1}$$ Note that if we set $a = 1, b = 1$ then the Beta distribution is the uniform distribution. So, more generally if we have $X_i \sim_{iid} \mathrm{Bernoulli}(\pi), i = 1, \ldots, N$ and we define a prior on $\pi$ as $\pi \sim \mathrm{Beta}(a, b)$ then we have \begin{align*} P(\pi | X_{1:N}) \propto p(X_{1 : N} | \pi)p(\pi) &= \left(\pi^{\sum X_i} (1 - \pi)^{N - \sum X_i}\right) \left(\frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \pi^{a - 1} (1 - \pi)^{b - 1}\right)\\ &\propto \pi^{\sum X_i + a - 1} (1 - \pi)^{N - \sum X_i + b - 1}\\ &= \mathrm{Beta}(a + \sum X_i, b + N - \sum X_i) \end{align*} A property of Beta distributions is that $\mathbb{E}[\pi | X_{1:N}] = \frac{a + \sum x_i}{a + b + N}$. So, as $N$ increases, the terms $a$ and $b$ become negligible and so the expectation converges on the empirical estimate. The variance is then $$\mathrm{Var}(\pi | X_{1:N}) = \frac{(a + \sum x_i)(b + N - \sum X_i)}{(a + b + N)^2 (a + b + N - 1)}$$ which goes down like some constant divided by $N$. So the variance decreases as $N$ increases. The posterior therefore captures the uncertainty in our estimate of $\pi$.

### Exponential family distributions

Most models of interest can be written in exponential family form, where $p(x | \eta) = h(x) \mathrm{exp}(\eta^\top t(x) - A(\eta))$. Each term has a name; $\eta$ is the natural parameter, $t(x)$ is the sufficient statistic, and $A(\eta)$ is the log-normalizer which ensures the whole thing sums to 1. We can compute the expectation by \begin{align*} \int p(x | \eta)dx &= 1\\ \rightarrow \nabla_\eta \int p(x | \eta)dx &= 0\\ \rightarrow \int (t(x) - \nabla_\eta A(\eta))p(x | \eta)dx &= 0\\ \rightarrow \int t(x)p(x | \eta)dx - \nabla_\eta A(\eta) \int p(x | \eta)dx &= 0\\ \rightarrow \mathbb{E}[t(x)] - \nabla_\eta A(\eta) \int p(x | \eta)dx &= 0\\ \rightarrow \mathbb{E}[t(x)] - \nabla_\eta A(\eta) \int p(x | \eta)dx &= 0\\ \rightarrow \mathbb{E}[t(x)] - \nabla_\eta A(\eta) &= 0\\ \rightarrow \mathbb{E}[t(x)] &= \nabla_\eta A(\eta) \end{align*} where the last simplification was made based on the fact that $\int p(x | \eta)dx = 1$. Similarly, we can compute \begin{align*} \nabla^2_\eta \int p(x | \eta)dx &= 0\\ \rightarrow -\nabla_\eta^2 A(\eta)\int p(x | \eta)dx - \int (t(x) - \nabla_\eta A(\eta))^2p(x | \eta)dx &= 0\\ \rightarrow -\nabla_\eta^2 A(\eta) - \int (t(x) - \nabla_\eta A(\eta))^2p(x | \eta)dx &= 0\\ \rightarrow \mathbb{E}[t(x)t(x)^\top] - \mathbb{E}[t(x)]\nabla_\eta A(\eta)^\top - \nabla_\eta A(\eta)\mathbb{E}[t(x)]^\top + \nabla_\eta A(\eta) \nabla_\eta A(\eta)^\top &= \nabla^2 A(\eta)\\ \rightarrow \mathbb{E}[t(x)t(x)^\top] - \nabla_eta A(\eta) \nabla_\eta A(\eta)^\top - \nabla_\eta A(\eta)\nabla_\eta A(\eta)^\top + \nabla_\eta A(\eta) \nabla_\eta A(\eta)^\top &= \nabla^2 A(\eta)\\ \rightarrow \mathbb{E}[t(x)t(x)^\top] - \mathbb{E}[t(x)]^2 &= \nabla_\eta^2 A(\eta)\\ \rightarrow \mathrm{Var}[t(x)] &= \nabla_\eta^2 A(\eta) \end{align*} where we made use of the relation $\mathrm{Var}[x] = \mathbb{E}[x^2] - \mathbb{E}[x]^2$.

##### Example: Bernoulli

We can write $$p(x | \theta) = \pi^x (1 - \pi)^{1 - x} = \mathrm{exp}\left((x \log\left(\frac{\pi}{1 - \pi}\right) + \log(1 - \pi)\right)$$ where $\eta(\pi) = \log\left(\frac{\pi}{1 - \pi}\right)$, $t(x) = x$ and $A(\eta) = \log(1 + e^\eta)$. We can verify the expectation and variance by using the relations above and computing $$\mathbb{E}[x] = \frac{\partial A(\eta)}{\partial \eta} = \frac{e^\eta}{1 + e^\eta} = \frac{1}{1 + e^{-\eta}} = \pi$$ and $$\mathrm{Var}[x] = \frac{\partial^2 A(\eta)}{\partial \eta^2} = \left(\frac{e^\eta}{1 + e^\eta}\right)\left(1 - \frac{e^\eta}{1 + e^\eta}\right) = \pi(1 - \pi)$$

#### Conjugate prior

In general, if we have data $X \sim p(X | \theta)$ and $\theta \sim p(\theta)$ then we say that the prior $p(\theta)$ is conjugate when the posterior $p(\theta | x)$ is in the same family as $p(\theta)$. Any exponential family distribution with $p(X | \eta) = h(x) \mathrm{exp}(\eta^\top t(x) - A(\eta))$ has a conjugate prior on $\eta$ given by $$p(\eta | \alpha, \nu) = f(\alpha, \nu) \mathrm{exp}(\eta^\top \alpha - \nu A(\eta))$$ where $f(\alpha, \nu)$ is a normalizing coefficient. Applying Bayes' rule gives \begin{align*} p(\eta | X_{1:N}) &\propto p(X_{1:N} | \eta) p(\eta)\\ &\propto \left( \prod_{i = 1}^N \mathrm{exp}(\eta^\top t(X_i) - A(\eta)) \right) \mathrm{exp}(\eta^\top \alpha - \nu A(\eta))\\ &\propto \mathrm{exp}\left(\eta^\top \left(\alpha + \sum_{i = 1}^N t(X_i)\right) - (\nu + N)A(\eta)\right)\\ &= p(\eta | \alpha^\prime, \nu^\prime) \end{align*} where $$\alpha^\prime = \alpha + \sum_{i = 1}^N t(X_i)$$ and $v^\prime = \nu + N$. This gives a closed-form update for the parameters of the model given new data points.

##### Example: Bernoulli

From above, we have $\eta = \log\left(\frac{\pi}{1 - \pi}\right)$ and $A(\eta) = \log(1 + e^\eta)$ and so the conjugate prior is $p(\eta | \alpha, v) \propto \mathrm{exp}(\eta \alpha - v\log(1 + e^\eta))$. We then get that $$p(\pi | \alpha, v) = \mathrm{exp}\left(\alpha \log \left(\frac{\pi}{1 - \pi}\right) - v\log\left(\frac{1}{1 - \pi}\right)\right)$$ which is $\mathrm{Beta}(\alpha + 1, v - \alpha + 1)$.

### Distributions of sums of random variables

If $X \sim p(X)$ and $Y \sim p(Y)$ which are independent then what can we say about $Z = X + Y$? Define the characteristic function of a random variable as $$\mathbb{E}[e^{i t X}] = \int e^{i t X}p(x) d x$$ The characteristic function of $Z$ is then given by \begin{align*} \mathbb{E}[e^{itZ}] &= \mathbb{E}[e^{it(X + Y)}]\\ &= \mathbb{E}[e^{itX}e^{itY}]\\ &= \int \int p(X, Y)e^{itX} e^{itY} dx dy\\ &= \int \int p(X)p(Y)e^{itX} e^{itY} dx dy\\ &= \left(\int p(X) e^{itX} d x\right) \left(\int p(Y) e^{itY} dy\right)\\ &= \mathbb{E}[e^{itX}] \mathbb{E}[e^{itY}] \end{align*} So the characteristic function of $Z$ is just the product of the characteristic functions for $X$ and $Y$ (when they are independent).

##### Example: Poisson-Multinomial

Say we have $X_n \sim \mathrm{Poisson}(\lambda_n), n = 1, \ldots, N$ where $\mathrm{Poisson}(X_n | \lambda_n) = \frac{\lambda^{X_n} e^{-\lambda_n}}{X_n!}$ and $n \in \{0, 1, 2, \ldots \}$. Now, let $Y = \sum_n X_n$. What can we say about $p(X_1, \ldots, X_N | Y)$? Using Bayes' rule, we have \begin{align*} p(X_1, \ldots, X_N | Y) &= \frac{p(Y | X_{1 : N}) p(X_1, \ldots, X_N)}{p(Y)}\\ &= \frac{p(X_1, \ldots, X_N)}{p(Y)} \end{align*} because $p(Y | X_{1 : N}) = 1$ because $X_{1 : N}$ gives you all the information you need to determine the value of $Y$. By independence, we can say that $$\frac{p(X_1, \ldots, X_N)}{p(Y)} = \frac{\prod_{n = 1}^N p(X_n)}{p(Y)}$$ So, if $X \sim \mathrm{Poisson}(\lambda)$ then its characteristic function is \begin{align*} \mathbb{E}[e^{itX}] &= \sum_{x = 0}^\infty e^{itx} \frac{ \lambda^x e^{-\lambda}}{x!}\\ &= e^{-\lambda} \sum_{x = 0}^\infty \frac{(\lambda e^{it})^x}{x!}\\ &= e^{-\lambda}e^{-\lambda it}\\ &= e^{-\lambda(1 - e^{it})} \end{align*} Now, we have $Y = \sum_n X_n$ so \begin{align*} \mathbb{E}[e^{itY}] &= \prod_{n = 1}^N \mathbb{E}[e^{itX_n}]\\ &= e^{-\left(\sum_n \lambda_n\right)(1 - e^{it})} \end{align*} because of the uniqueness of the characteristic function, $Y \sim \mathrm{Poisson}\left(\sum_n \lambda_n\right)$.

With some manipulation, we can see that \begin{align*} P(X_1, \ldots, X_N | Y) &= \frac{\prod_{n = 1}^N P(X_n)}{P(Y)}\\ &= \frac{\mathrm{exp}\left(\sum_n \lambda_n\right) \prod_{n = 1}^N \frac{\lambda_n^{X_n}}{X_n!}}{\mathrm{exp}\left(-\sum_n \lambda_n\right) \frac{\left(\sum_n\lambda_n\right)^Y}{Y!}}\\ &= \frac{Y!}{X_1! \cdots X_N!} \prod_{n = 1}^N \left(\frac{\lambda_n}{\sum_n \lambda_n} \right)^{X_n}\\ \end{align*} which is a multinomial distribution. Probabilistically speaking, drawing from $Y$ and drawing from this multinomial distribution is the same.

## Regression

Given data $D = \{(x_i, y_i)\}_{i = 1}^N$ where $x_i \in \mathbb{R}^d$ and $y_i \in mathbb{R}$ the goal of regression is to model the relationship between the “input” $x$ and “output” $y$ so that given a new $x$ we can predict the corresponding response $y$.

### Linear Regression

For linear regression, we assume that the output is linearly related to the input, so that $$y_i \approx f(x_i) = \omega_0 + \sum_{k = 1}^d \omega_k x_{i, k}$$ This reduces out goal to finding $\omega = [\omega_0, \ldots \omega_d]$. For convenience, we construct the matrix $X = [1, x_1^\top; 1, x_2^\top; \ldots 1, x_N^\top]$ so that $f(X) = X\omega$.

#### Least Squares Solution

We can find a solution for $\omega$ which minimizes the squared Euclidian distance between $X\omega$ and $y$, giving $$\omega_{LS} = \mathrm{arg}\min_\omega \|y - X\omega\|^2 = \mathrm{arg}\min_\omega (y - X\omega)^\top(y - X\omega)$$ Differentiating and setting equal to zero yields the solution $\omega_{LS} = (X^\top X)^{-1}X^\top y$. From a probabilistic perspective, using least square is implicitly assuming that the error is Gaussian noise, so that $y_i \sim \mathcal{N}(x_i^\top \omega, \sigma^2)$ or in vector form $y \sim \mathcal{N}(X\omega, \sigma^2\mathbb{I})$. This gives the likelihood $$p(y | X, \omega, \sigma^2) = (2\pi \sigma^2)^{-\frac{N}{2}} \mathrm{exp}\left(-\frac{1}{2\sigma^2}\|y - X\omega\|^2\right)$$ Differentiating the log-likelihood and setting equal to zero gives the maximum likelihood solution $\omega_{ML}$, and as expected $\omega_{ML} = \omega_{LS}$. We can now compute the expected value of $\omega_{ML}$ by $\mathbb{E}[\omega_{ML}] = (X^\top X)^{-1}X^\top \mathbb{E}[y] = (X^\top X)^{-1}X^\top X\omega = \omega$ and the covariance \begin{align*} \mathrm{Var}[\omega_{ML}] &= \mathbb{E}[(\omega_{ML} - \mathbb{E}[\omega_{ML}])(\omega_{ML} - \mathbb{E}[\omega_{ML}])^\top]\\ &= \mathbb{E}[(X^\top X)^{-1} X^\top y - (X^\top X)^{-1}X^\top X\omega)((X^\top X)^{-1} X^\top y - (X^\top X)^{-1}X^\top X\omega)^\top]\\ &= (X^\top X)^{-1} X^\top \mathbb{E}[(y - X\omega)(y - X\omega)^\top]X(X^\top X)^{-1}\\ &= (X^\top X)^{-1} X^\top \sigma^2 \mathbb{I} X(X^\top X)^{-1}\\ &= \sigma^2 (X^\top X)^{-1} X^\top X(X^\top X)^{-1}\\ &= \sigma^2 (X^\top X)^{-1} \end{align*} The matrix $X$ can be expressed by its singular value decomposition $X = V\Lambda^{1/2}Q^\top$ where $V$ and $Q^\top$ are the left and right orthonormal singular vector matrices and $\Lambda^{1/2}$ is the diagonal matrix of singular values. When the dimensions of $X$ are highly correlated, some of its singular values (entries of $\Lambda^{1/2}$) will be very small. We can express \begin{align*} \mathrm{Var}[\omega_{ML}] &= \sigma^2 ((V\Lambda^{1/2}Q^\top)^\top V\Lambda^{1/2}Q^\top)^{-1}\\ \mathrm{Var}[\omega_{ML}] &= \sigma^2 (Q\Lambda^{1/2}V^\top V\Lambda^{1/2}Q^\top)^{-1}\\ \mathrm{Var}[\omega_{ML}] &= \sigma^2 (Q\Lambda Q^\top)^{-1}\\ \mathrm{Var}[\omega_{ML}] &= \sigma^2 Q\Lambda^{-1} Q^\top\\ \end{align*} Note that when $X$ has highly correlated dimensions, it will have singular values $\Lambda$ which decay rapidly when sorted by magnitude, and so the inversion $\Lambda^{-1}$ will cause $\mathrm{Var}[\omega_{ML}]$ to have some very large values.

#### Ridge Regression/MAP

One solution to the highly-correlated $X$ situation is to add an $L^2$ penalty to $w$, giving the ridge regression solution $$\omega_{RR} = \mathrm{arg}\min_{\omega} \frac{1}{2\sigma^2} \|y - X\omega\|^2 + \frac{\lambda}{2} \|\omega\|^2$$ As before, we can differentiate and set equal to zero to find the solution $$\omega_{RR} = (\lambda \sigma^2 \mathbb{I} + X^\top X)^{-1} X^\top y$$ This is the same as the maximum likelihood solution except that we have added the term $\lambda \sigma^2 \mathbb{I}$ which increases the eigenvalues of $X^\top X$ when inverting. This is equivalent to putting a prior on $\omega$ such that $\omega \sim \mathcal{N}(0, \lambda^{-1}\mathbb{I})$. So, we end up finding \begin{align*} \omega_{MAP} &= \mathrm{arg}\max_\omega \log(p(\omega | D))\\ &= \mathrm{arg}\max_\omega \log(p(D|\omega)p(\omega)) + \mathrm{const}\\ &= \mathrm{arg}\max_\omega -frac{1}{2\sigma^2} \|y - X\omega\|^2 - \frac{\lambda}{2} \|\omega\|^2 + \mathrm{const}\\ &= \omega_{RR} \end{align*} So finding the maximum a posteriori estimate with a Gaussian prior on $\omega$ is the same as performing ridge regression. As before, we can compute the expected value of the solution by $$\mathbb{E}[\omega_{MAP}] = (\lambda \sigma^2 \mathbb{I} + X^\top X)^{-1} X^\top \mathbb{E}[y] = (\lambda \sigma^2 \mathbb{I} + X^\top X)^{-1} X^\top X \omega$$ so that we essentially have a biased estimate of the maximum likelihood solution. Similarly, we can compute the variance, giving \begin{align*} \mathrm{Var}(\omega_{MAP}) &= \mathbb{E}[(\omega_{MAP} - \mathbb{E}[\omega_{MAP})(\omega_{MAP} - \mathbb{E}[\omega_{MAP}])^\top]\\ &= (\lambda \sigma^2 \mathbb{I} + X^\top X)^{-1} X^\top \mathbb{E}[(y - X\omega)(y - X\omega)^\top]X(\lambda \sigma^2 \mathbb{I} + X^\top X)^{-1}\\ &= \sigma^2 (\lambda \sigma^2 \mathbb{I} + X^\top X)^{-1} X^\top X (\lambda \sigma^2 \mathbb{I} + X^\top X)^{-1}\\ &= \sigma^2 (\lambda \sigma^2 \mathbb{I} + Q\Lambda Q^\top)^{-1} X^\top X (\lambda \sigma^2 \mathbb{I} + X^\top X)^{-1}\\ &= \sigma^2 (Q(\lambda \sigma^2\mathbb{I})Q^\top + Q\Lambda Q^\top)^{-1} X^\top X (\lambda \sigma^2 \mathbb{I} + X^\top X)^{-1}\\ &= \sigma^2 (Q(\lambda \sigma \mathbb{I} + \Lambda)Q^\top)^{-1} X^\top X (\lambda \sigma^2 \mathbb{I} + X^\top X)^{-1}\\ &= \sigma^2 Q\tilde{\Lambda}^{-1}Q^\top \end{align*} where we have defined $\tilde{\Lambda}$ such that $$\tilde{\Lambda}^{-1}_{ii} = \frac{\Lambda_{ii}}{(\lambda \sigma^2 + \Lambda_{ii})^2}$$

If we have that $y \sim \mathcal{N}(X\omega, \sigma^2 \mathbb{I})$ and $\omega \sim \mathcal{N}(0, \lambda \mathbb{I})$ then $p(\omega | D) \propto p(y | X, \omega)p(\omega)$. If we have a Gaussian prior on the mean of a Gaussian, then the posterior is still Gaussian, so we can compute the posterior as $p(\omega | D) = \mathcal{N}(\omega | \mu, \Sigma)$ where $\mu = (\lambda \sigma \mathbb{I} + X^\top X)^{-1}X^\top y$ and $\Sigma = (\lambda \mathbb{I} - \frac{1}{\sigma^2} X^\top X)^{-1}$. This can be “read off” directly from our knowledge about the form of the posterior of a Gaussian with mean with a Gaussian prior.

When computing the approximation $\hat{y}_{new}$ from some $X_{new}$ which was not in $D$ using maximum likelihood, we compute $\hat{y}_{new} = x^\top_{new} \omega_{ML}$. On the other hand, with MAP or ridge-regression, we have $\hat{y}_{new} = x^\top_{new} \omega_{MAP}$. From a Bayesian standpoint, we have

$$p(y_{new} | x_{new}, X, y) = \int p(y_{new} | x_{new}, \omega)p(\omega | X, y)d \omega = \mathcal{N}(y_{new} | \mu^\top x_{new}, \sigma^2 + x_{new} \Sigma x_{new})$$

by marginalizing according to our assumptions on the model. This gives us a measure of uncertainty of our estimate of $y_{new}$. In this way, if we are presented with new data, we can decide whether to measure $y_{new}$ directly depending on its amount of uncertainty; if we do measure $y_{new}$ we can use it to update the parameters of our model. However, being confident in the model doesn't necessarily mean it's a good model.

#### Nonlinear regression

If we don't expect $y$ to be a linear function of $x$, we can still have a model which is linear in $\omega$ but can express this relationship by instead passing the data through some nonlinearity $\phi$ to compute $X = [1, \phi(x_1)^\top; 1, \phi(x_2)^\top; \ldots 1, \phi(x_N)^\top]$. We then have exactly the same learning problem, where we are trying to learn linear weights - the data matrix $X$ has just been modified.

## Classification

As with regression, we are given data $\{(x_i, y_i)\}_{i = 1}^N$ with measurements $x_i \in \mathbb{R}^d$ but instead we have $y_i \in \{0, 1\}$ which are discrete class labels. Given some $x_{new}$, we'd like to predict $y_{new}$. We can view this as a regression problem where we have latent variables (ones we don't see) $z \sim \mathcal{N}(X\omega, \sigma^2 \mathbb{I})$ and $\omega \sim \mathcal{N}(0, \lambda^{-1}\mathbb{I})$ where $z_i \ge 0 \rightarrow y_i = 1$ and $z_1 < 0 \rightarrow y_i = 0$

### Bayes classifier

We assume that $y_i \sim_{iid} \mathrm{Bern}(\pi)$ and that $x_i | y_i \sim \mathcal{N}(\mu_{y_i}, \sigma^2)$. So, we have $p(y = 0) = 1 - \pi$ and $p(y = 1) = \pi$. The resulting label $y$ selects which of the two Gaussians we draw from to construct $x_i$. To be Bayesian, we use a prior $\pi \sim \mathrm{Beta}(a, b)$ and $\mu_i \sim \mathcal{N}(0, \nu)$, which gives the posterior \begin{align*} P(\pi, \mu_0, \mu_1 | D) &\propto p(D | \pi, \mu_0, \mu_1) p(\pi, \mu_0, \mu_1)\\ &\propto p(X_{1:N} | y_{1:N}, \pi, \mu_0, \mu_1) p(y_{1:N} | \pi, \mu_0, \mu_1) p(\pi, \mu_0, \mu_1)\\ &\propto \left( \prod_{i = 1}^N p(x_i | \mu_{y_i}) \right)\left(\prod_{i = 1}^N p(y_i | \pi) \right) p(\pi) p(\mu_1) p(\mu_0)\\ &\propto \left(p(\mu_0) \prod_{i \in A_0} p(x_i | \mu_0) \right)\left(p(\mu_1) \prod_{i \in A_1} p(x_i | \mu_i) \right) \left(p(\pi)\prod_{i = 1}^N p(y_i | \pi)\right) \end{align*} where we made the substitution $A_\ell = \{i : y_i = \ell\}$. In this case, each of these terms is independent, so we can normalize separately to find posteriors. The posteriors $p(\mu_\ell | x_{i \in A_\ell})$ is a normal distribution, and $p(\pi | y_{1:N})$ is a Beta distribution.

Now, given a new $x_{N + 1}$, we'd like to predict $y_{N + 1}$. We can compute \begin{align*} p(y_{N + 1} = 1 | x_{N + 1}, D) &\propto p(x_{N + 1} | x_{i \in A_i}) p(y_{N + 1} = 1 | y_{1 : N})\\ &\propto \int_{-\infty}^\infty p(x_{N + 1} | \mu_1) p(\mu_1 | x_{i \in A_1}) d\mu_1 p(y_{N + 1} = 1 | y_{1 : N})\\ &\propto \int_{-\infty}^\infty p(x_{N + 1} | \mu_1) p(\mu_1 | x_{i \in A_1}) d\mu_1 \int_0^1 p(y_{N + 1} = 1 | \pi) p(\pi | y_{1:N})du \end{align*} Given these expressions, we can compute the marginal distributions analytically for $y_{N + 1} = 0$ and $y_{N + 1} = 1$ and normalize to get a posterior probability.

### Logistic Regression

As with regression, we say $x \in \mathbb{R}^{d + 1}$ where the first dimension is set equal to $1$ for the bias term. We'd like to learn a vector of coefficients $\omega \in \mathbb{R}^{d + 1}$ to do classification. In the Bayes classifier, we have a model for the model and the data combined, so it's a generative model. Regression models are discriminative because they do not attempt to model the data. For logistic regression, we say that $y \sim \mathrm{Bern}(\sigma(x^\top \omega))$ where $\sigma(x^\top \omega) = \frac{e^{x^\top \omega}}{1 + e^{x^\top \omega}}$. With this model, $p(y = 1 | X, \omega) = \sigma(x^\top \omega)$. The logistic function maps values less than 0 towards 0 and values greater than 0 towards 1. Computing $x^\top \omega$ will produce a negative number on one side of the vector normal to $\omega$ and a positive value on the other side. Now $$P(y_1, \ldots, y_N | x_1, \ldots, x_N, \omega) = \prod_{i = 1}^N \sigma(x_i^\top \omega)^{y_i} (1 - \sigma(x_i^\top \omega)^{1 - y_i}$$ We can define a prior on the coefficient vector so that $\omega \sim \mathcal{N}(0, \lambda^{-1} \mathbb{I})$. The posterior is given by $$p(\omega | D) = \frac{\left(\prod_{i = 1}^N \sigma_i^{y_i} (1 - \sigma_i)^{1 - y_i} \right) \mathrm{exp}(-\frac{\lambda}{2} \omega^\top \omega)}{\int \left(\prod_{i = 1}^N \sigma_i^{y_i} (1 - \sigma_i)^{1 - y_i} \right) \mathrm{exp}(-\frac{\lambda}{2} \omega^\top \omega) d\omega}$$ Note that we can't compute the normalizing denominator, so we need a different way of computing the posterior.

#### MAP inference

The MAP solution is given by \begin{align*} \omega_{MAP} &= \mathrm{arg}\max_\omega \log(p(\omega | D))\\ &= \mathrm{arg}\max_\omega \log(p(D | \omega)) + \log(p(\omega)) + \mathrm{const}\\ &= \mathrm{arg}\max_\omega \left(\sum_{i = 1}^N y_i \omega^\top x_i - \log(1 + e^{\omega^\top x_i}) \right) - \frac{\lambda}{2} \omega^\top \omega + \mathrm{const} \end{align*} We can iteratively find this maximum using gradient descent: First, initialize $\omega = \vec{0}$. Then, for each step $t$ set $\omega_t = \omega_{t - 1} - (\nabla_\omega^2 L)^{-1} \nabla_\omega L$ where $L$ is the loss function $$L = \sum_{i = 1}^N (y_i \omega^\top x_i - \log( 1 + \mathrm{exp}(\omega^\top x_i)) - \frac{\lambda}{2} \omega^\top \omega$$ and iteratively continue this process until $\omega$ doesn't change. From above, we can compute $$\nabla_\omega L = \sum_{i = 1}^N (y_i - \sigma_i) x_i - \lambda \omega_{t - 1}$$ and $$\nabla^2_\omega L = \left(\sum_{i = 1}^N \sigma_i(1 - \sigma_i) x_i x_i^\top\right) - \lambda\mathbb{I}$$

#### Laplace Approximation

If we actually want to approximate $p(\omega | D)$, we can use Laplace approximation, a completely general model for models with non-conjugate priors. The posterior distribution is $$p(\omega | D) \propto \mathrm{exp}(\log(p(D | \omega) p(\omega))) = \mathrm{exp}(\log(p(D, \omega))$$ Laplace approximation approximates this log joint-likelihood $p(\omega | D)$ with the second order Taylor expansion of a Gaussian $q(\omega | D) = \mathcal{N}(\omega | \omega_{MAP}, \Sigma)$ centered about the MAP, giving \begin{align*} q(\omega | D) &\propto \mathrm{exp}\left(\log(p(D, \omega_{MAP}) + (\omega - \omega_{MAP})^\top \nabla_\omega \log(p(D, \omega)|_{\omega_{MAP}} + \frac{1}{2} (\omega - \omega_{MAP})^\top \nabla_\omega^2 \log(p(D, \omega))|_{\omega_{MAP}} (\omega - \omega_{MAP})\right)\\ &\propto \mathrm{exp}\left(-\frac{1}{2} (\omega - \omega_{MAP})^\top (-\nabla^2 \log(p(D, \omega))|_{\omega_{MAP}}) (\omega - \omega_{MAP})\right)\\ &= \mathcal{N}(\mu, \Sigma) \end{align*} because the first term is constant and the gradient at the maximum $w_{MAP}$ is equal to $0$, and where $\mu = w_{MAP}$ and $$\Sigma = (-\nabla_\omega^2 \log(p(D, \omega))|_{\omega_{MAP}})^{-1}$$ In the specific case of the logistic regression problem, we have $$\Sigma = (\lambda I + \sum_{i = 1}^N \sigma_I(1 - \sigma_i)X_i X_i^\top )^{-1}$$

#### Multiclass Logistic Regression

In the multiclass setting, we instead have $y \in \{1, \ldots, K\}$. Now, each class $k$ has a corresponding vector $\omega_k$ , and instead of a sigmoid function we use the softmax function $$\sigma(x^\top \omega_k) = \frac{e^{x^\top \omega_k}}{\sum_j e^{x^\top \omega_j}}$$ So, the probability of the labels given the data and coefficients is \begin{align*} P(y_1, \ldots y_N | x_1, \ldots x_N, \omega_1 \ldots \omega_K) &= \prod_{i = 1}^N p(y_1 | x_i, \omega_{1:K})\\ &= \prod_{i = 1}^N\left(\prod_{k = 1}^K \left(\frac{e^{x_i^\top \omega_k}}{\sum_j e^{x_i^\top \omega_j}} \right)^{\mathbf{1}(y_i = k)} \right)\\ &= \prod_{i = 1}^N \mathrm{Mult}(y_i | \sigma_i) \end{align*} where $\mathbf{1}(y_i = k)$ is the indicator function, and we denote $$\sigma_i(k) = \left(\frac{e^{x_i^\top \omega_k}}{\sum_j e^{x_i^\top \omega_j}} \right)$$ Our prior is then $$p(\omega_k) = \mathcal{N}(0, \lambda^{-1}\mathbb{I}) = \frac{\lambda^{d/2}}{(2\pi)^{d/2}} \mathrm{exp}\left(\frac{-\lambda}{2} \omega_k^\top \omega_k \right)$$ which gives the log-posterior \begin{align*} \log(p(\omega_1, \ldots, \omega_k | D)) &= \log\left(\prod_{i = 1}^N p(y_i|\omega_{1:K}, x_i)\right) + \log\left(\prod_{k = 1}^K p(\omega_k) \right) + \mathrm{const}\\ &= \left(\sum_{i = 1}^N \sum_{k = 1}^K \mathbf{1}(y_i = k) \omega_k^\top x_i\right) - \sum_{i = 1}^N \log\left(\sum_{j = 1}^K e^{\omega_j^\top x_i}\right) - \sum_{k= 1}^K \frac{\lambda}{2} \omega_k^\top \omega_k + \mathrm{const} \end{align*} From this, we can find an appropriate $\omega$ with gradient descent.

## Hierarchical Models

In linear classification and regression, we have a single object $\omega$ which we we want a posterior distribution on. In Hierarchical models, there are many objects we want posterior distributions for. This generally means that there are many variables interacting (so the model is more complex than prior, likelihood, posterior). This often means that certain parameters are more than one “step” way from the data, so that priors may have their own priors.

### Matrix Completion

Say we are given a matrix $M \in \mathbb{R}^{N_1 \times N_2}$ which has many values “missing”, that is we only know the true value for about $5\%$ of its entries. This situation arises in collaborative filtering settings, where the rows correspond to users and the columns correspond to ratings. Each user has probably only rated a few things, and we want to be able to predict their rating for some product they have not rated yet based on other user's rating. We believe that this matrix is low-rank, and model it as $M \approx uv + \mathrm{noise}$ where $u \in \mathbb{R}^{N_1 \times d}$ with rows $u_i \in \mathbb{R}^d$ and $v \in \mathbb{R}^{d \times N_2}$ with columns $v_j \in \mathbb{R}^d$.

#### Probabilistic Matrix Factorization

We can attack this problem from a Bayesian perspective. Let $\Omega = \{(i, j) : M_{i, j} \mathrm{observed}\}$ be the set of indices of observed entries of $M$. In the probabilistic matrix factorization formulation, we say that $M_{i, j} \sim_{i.i.d.} \mathcal{N}(u_i^\top v_j, \sigma^2 \mathbb{I})$ for $(i, j) \in \Omega$. We use priors $u_i \sim_{i.i.d.} \mathcal{N}(0, \lambda^{-1}\mathbb{I})$ and $v_j \sim_{i.i.d.} \mathcal{N}(0, \lambda^{-1}\mathbb{I})$. In this formulation, there is no orthonormality constraint on the factorization. We also must pick $d$. Once the model is fit, we can predict a missing value using the learned $u_i$ and $v_j$. In contrast to regression/classification, we two separate variables $u$ and $v$ that we need to learn. Given the data, we need to learn our posterior, which we can write down as \begin{align*} P(u_{1:N_1}, v_{1:N_2} | D) &= \frac{p(D | u_1, \ldots, u_{N_1}, v_1, \ldots, v_{N_2})p(u_1, \ldots, u_{N_1}, v_1, \ldots, v_{N_2})}{p(D)}\\ &= \frac{\left(\prod_{(i, j) \in \Omega} p(M_{ij} | u_i, v_j)\right)\left(\prod_{i = 1}^{N_1} p(u_i) \right)\left(\prod_{j = 1}^{N_2} p(v_j)\right)}{p(D)} \end{align*} where we were able to separate out the products based on the fact that our model assumes that the observed entries in $M$ are independent given $u$ and $v$ and that $M_{ij}$ is conditionally independent of everything except $u_i$ and $v_j$ and that the priors on $u_i$ and $v_j$ are independent. Now, the normalizing constant $p(D)$ is $$p(D) = \int_{u_1} \cdots \int_{u_{N_1}} \int_{v_1} \cdots \int_{v_{N_2}} \left(\prod_{(i, j) \in \Omega} p(M_{ij} | u_i, v_j)\right)\left(\prod_{i = 1}^{N_1} p(u_i) \right)\left(\prod_{j = 1}^{N_2} p(v_j)\right) dv_{N_2} \cdots dv_1 du_{N_1} \cdots du_1$$ which is conditionally conjugate, but the model variables become coupled in the likelihood and so is not tractable to compute. However, we can use MAP inference by finding \begin{align*} u^{MAP}_{1:N_1}, v^{MAP}_{1:N_2} &= \mathrm{arg}\max_{u_{1:N_1}, v_{1:N_2}} \sum_{(i, j) \in \Omega} \log\left(p(M_{ij} | u_i, v_j)\right) + \sum_{i = 1}^{N_1} \log\left(p(u_i)\right) + \sum_{j = 1}^{N_2} \log\left(p(v_j)\right)\\ &= \mathrm{arg}\max_{u_{1:N_1}, v_{1:N_2}} \sum_{(i, j) \in \Omega} -\frac{1}{2\sigma^2} \left(M_{ij} - u_i^\top v_j\right)^2 + \sum_{i = 1}^{N_1} -\frac{\lambda}{2} u_i u_i^\top + \sum_{j = 1}^{N_2} -\frac{\lambda}{2} v_j^\top v_j + \mathrm{const}\\ &= \mathrm{arg}\max_{u_{1:N_1}, v_{1:N_2}} \mathcal{L} \end{align*} where in the last step we simply defined $\mathcal{L}$ for this particular problem.

#### Solving using Coordinate Ascent

We can solve an optimization problem like the one above with coordinate ascent, which is a simple algorithm which optimizes by each free variable individually until convergence as follows: First, initialize $u_{1:{N_1}}, v_{1:{N_2}}$ (e.g. randomly), then, until convergence, optimize $\mathcal{L}$ separately with respect to each $u_i$ and $v_j$ holding everything else fixed. The optimization step for one of the free variables (here, e.g., $u_i$) can be seen by differentiating and setting equal to zero by \begin{align*} \nabla_{u_i} \mathcal{L} &= \sum_{j : (i, j) \in \Omega} \frac{1}{\sigma^2} (M_{ij} - u_i^\top v_j)v_j - \lambda u_i = 0\\ &= \frac{1}{\sigma^2} \sum_{j : (i, j) \in \Omega} M_{ij} v_j - \frac{1}{\sigma^2} \left(\sum_{j : (i, j) \in \Omega} v_j v_j^\top\right)u_i - \lambda u_i = 0\\ \rightarrow u_i &= \left(\lambda \sigma^2 \mathbb{I} + \sum_{j : (i, j) \in \Omega} v_j v_j^\top \right)^{-1} \left(\sum_{j : (i, j) \in \Omega} M_{ij} v_j \right) \end{align*} Recall that in L2-regularized least squares setting (ridge regression) where $y \sim \mathcal{N}(X\omega, \sigma^2\mathbb{I})$ and $\omega \sim \mathcal{N}(0, \lambda^{-1}\mathbb{I})$ yielded the MAP solution $\omega_{MAP} = (\lambda \sigma^2 \mathbb{I} + X^\top X)^{-1}X^\top y$. By inspecting the form of $u_i$ above, and letting $S_{u_i} = \{j : (i, j) \in \Omega\}$, $\vec{m}_{u_i} = M_{i, S_{u_i}}$ and $V_i$ be a matrix whose rows are $v_{S_{u_i}}$ our model becomes $\vec{m}_{u_i} \sim \mathcal{N}(V_i, u_i, \sigma^2\mathbb{I}), u_i \sim \mathcal{N}(0, \lambda^{-1}\mathbb{I})$ which yields $u_i^{MAP} = (\lambda\sigma^2\mathbb{I} + V_i^\top V_i)^{-1} V_i^\top \vec{m}_{u_i}$ which is the same as what was derived above, except in matrix form.

So, at each iteration, we need to solve $N_1 + N_2$ ridge regression problems. The “design matrix” (corresponding to $X$ in $y \sim \mathcal{N}(X\omega, \sigma^2 \mathbb{I})$ is changing, we have to iterate several times until convergence. It would be nice to get the posterior distribution of $u_{1:N_1}$ and $v_{1:N_2}$ but they become coupled in the likelihood so we write $p(u_i | D)$ and $p(v_j | D)$ separately, that is, we can't write $$p(u_{1:N_1}, v_{1:N_2} | D) \ne \left(\prod_{i = 1}^{N_1} p(u_i | D) \right)\left(\prod_{j = 1}^{N_2} p(v_j | D)\right)$$ as we were able to in the prior. However, we do have “conditional conjugacy”, in that we can write $$p(u_i | D, u_{\not{i}}, v_{1:{N_2}}) \propto p(D | u_i, v_{1:{N_2}})p(u_i)$$ where $u_{\not{i}}$ denotes all $u_j$ such that $j \ne i$ and $p(D | u_i, v_{1:{N_2}})$ is an independent Gaussian and $p(u_i)$ is a Gaussian prior on “regression coefficents”. Using our matrix formulation described above, we can write $$p(u_i | D, V_i) = \mathcal{N}(u_i | \mu, \Sigma); \mu = (\lambda \sigma^2\mathbb{I} + V_i^\top V_i)^{-1} V_i^\top \vec{m}_{u_i}; \Sigma = \left(\lambda \mathbb{I} + \frac{1}{\sigma^2} V_i^\top V_i\right)^{-1}$$ It follows that conditioned on everything else, $u_i$ has a Gaussian prior, just as for the linear regression problem.

#### Solving using Gibbs Sampling

Obtaining a point-estimate solution is useful, but somewhat defeats the purpose of a Bayesian/probabilistic model which should capture uncertainty in all possible solutions. Sampling methods are another way to obtain an estimate of the solution, but rather than picking a specific point they sample from the posterior. The samples can then be used to estimate statistics on possible solutions. In a general sense, we want the posterior $p(\theta_1, \ldots, \theta_N | D)$ but we can't compute it. In Gibbs sampling, we instead calculate $p(\theta_i | \theta_{\not{i}}, D)$ and sample from it as follows: First, initialize $\theta_1^{(0)}, \ldots, \theta_N^{(0)}$, then for each step $t$ sample $\theta_i^{(t)} \sim p(\theta_i | \theta_1^{(t)}, \ldots, \theta_{i - 1}^{(t)}, \theta_{i + 1}^{(t - 1)}, \ldots, \theta_N^{(t - 1)}, D)$. After a “burn-in period”, e.g. $t = 1000$, we can start collecting samples of $\theta_1^{(t)}, \ldots, \theta_N^{(t)}$, skipping every e.g. 100 samples to account for correlations.

In the specific example of Gibbs sampling for the matrix factorization model, we begin by randomly initializing $u_{1:N_1}$ and $v_{1:N_2}$ and, for each iteration, construct $\vec{m}_{u_i}$ and $V_i$ using the most recent values then sample $u_i \sim \mathcal{N}(\mu, \Sigma)$ and set $\mu = (\lambda \sigma^2 \mathbb{I} + V_i^\top V_i)^{-1} V_i^\top \vec{m}_{u_i}$ and do the same thing for $v_j$ except that $\vec{m}_{v_j}$ is from a column $u_j$ and pulls from $u_1, \ldots, u_{N_1}$.

### Sparse Regression

Recall that in the regression setting, we have data $y \in \mathbb{R}^N$ and covariates $X \in \mathbb{R}^{N \times d}$. We assume $y$ is produced by the likelihood model $y \sim \mathcal{N}(X\omega, \sigma^2 \mathbb{I})$ with $\omega$ unknown, with prior $\omega \sim \mathcal{N}(0, A^{-1})$. In the case where $d \ll N$, we might believe that only a small subset of the dimensions of $X$ are important for predicting $y$. We'd like to be able to select these dimensions implicitly while estimating $\omega$. We do this using a “sparsity promoting prior”. If we say that $\omega_k \sim \mathcal{N}(0, \alpha_k^{-1}$ so that $A = \mathrm{diag}(\alpha_1, \ldots, \alpha_d)$ then we can encourage sparity by encouraging large values for some of the $\alpha_k$ - in this way, the prior variance of $\omega_k$ approaches zero (as $\alpha_k \rightarrow \infty, \alpha_k^{-1} \rightarrow 0$), so that $\omega_k$ will be $0$ with high probability. A Gammma prior on $\alpha_k$ can be used to encourage this, so that our model is given by $$y \sim \mathcal{N}(X\omega, \sigma^2\mathbb{I}); \omega \sim \mathcal{N}(0, A^{-1}); \alpha_k \sim_{i.i.d.} \mathrm{Gamma}(a, b)$$ so that $$p(\alpha_k | a, b) = \frac{b^a}{\Gamma(a)} \alpha_k^{a - 1}e^{-b\alpha_k}$$ We can find the MAP parameter estimates by computing \begin{align*} P(\omega, \alpha_{1:d} | X, y) &\propto p(y | X, \omega, \alpha_{1:d})p(\omega, \alpha_{1:d})\\ &\propto p(y | X, \omega)p(\omega | \alpha_{1:d})p(\alpha_{1:d})\\ &\propto p(y | X, \omega)p(\omega | \alpha_{1:d}) \prod_{i = 1}^d p(\alpha_i)\\ \rightarrow \log\left(p(\omega, \alpha_{1:d} | X, y)\right) &= \log\left(p(y | X, \omega)\right) + \log\left(p(\omega | \alpha_{1:d})\right) + \sum_{i = 1}^d \log\left(p(\alpha_i)\right) + \mathrm{const}\\ &= -\frac{1}{2\sigma^2} (y - X\omega)^\top(y - X\omega) + \sum_{i = 1}^d \left( -\frac{\alpha_i}{2}\omega_i^2 + \frac{1}{2}\log(\alpha_i) \right) + \sum_{i = 1}^d \left((\alpha - 1)\log(\alpha_i) - b\alpha_i\right) + \mathrm{const}\\ &= \mathcal{L} \end{align*} which gives $$\omega_{MAP}, \alpha^{MAP}_{1:d} = \mathrm{arg}\max_{\omega, \alpha_{1:d}} \mathcal{L} = \mathrm{arg}\max_{\omega, \alpha_{1:d}} \log\left(p(\omega, \alpha_{1:d} | X, y)\right)$$ We can perform this optimization using MAP estimation. First, initialize $\omega$, e.g. to the minimum L2 solution $\omega = X^\top (X X^\top)^{-1}y$. Then, iteratively update $\omega$ by $$\nabla_\omega \mathcal{L} = 0 \rightarrow \omega = \frac{\left(A + \frac{1}{\sigma^2} X^\top X\right)^{-1} X^\top y}{\sigma^2}$$ and update $\alpha_i$ by $$\frac{\partial \mathcal{L}}{\partial \alpha_i} = 0 \rightarrow \alpha_i = \frac{2a - 1}{\omega_i^2 + 2b}$$ We can encourage $\alpha_i$ to grow to $\infty$ when $\omega_i \rightarrow 0$ by setting $b = 10^{-16}$ and $a = \frac{1}{2} + \epsilon$, where the $\epsilon$ parameter sets how fast the values grow.

Note that $\left(A + \sigma^{-2} X^\top X\right)^{-1}$ is the inversion of a $d \times d$ matrix, and in this setting $d$ is large while $N$ is small and $A$ is diagonal. The matrix inversion lemma is simply the relation $$(A + BC^{-1}D)^{-1} = A^{-1} - A^{-1}B(C + DA^{-1}B)^{-1}DA^{-1}$$ which allows us to write $$(A + \sigma^{-2}X^\top X)^{-1} = A^{-1} - A^{-1}X^\top(\sigma^2 \mathbb{I} + XA^{-1}X^\top)^{-1} XA^{-1}$$ which involves only the inversion of a diagonal matrix and a $N \times N$ matrix.

#### Type II Maximum Likelihood

Sometimes, we can learn better models by integrating out the uncertainty (prior), although this may make the model intractable. In the current example, our uncollapsed model is given by $y | \omega \sim \mathcal{N}(X\omega, \sigma^2 \mathbb{I})$ and $\omega | A \sim \mathcal{N}(0, A^{-1})$. We can collapse this model so that $p(y | A) = \int_{\omega} p(y | \omega) p(\omega | A)d\omega$ and $y | A \sim \mathcal{N}(0, \sigma^2 \mathbb{I} + XA^{-1}X^\top)$. In type II maximum likelihood, we use a Bayesian model on $\omega$ but do not use a prior on $A$; we integrate out $\omega$ and perform maximum likelihood on the implied prior on $\omega$ by computing $\alpha_{1:d}^{ML2} = \mathrm{arg}\max_{\alpha_{1:d}} \log(p(y | A))$. This gives a loss of $$\mathcal{L} = \log(p(y | a)) = -\frac{1}{2} \log\left(\left|\sigma^2 \mathbb{I} + XA^{-1}X^\top \right|\right) - \frac{1}{2} y^\top (\sigma^2 \mathbb{I} + XA^{-1}X^\top)^{-1}y + \mathrm{const}$$ If we let $C_{\not{i}} = \sigma^2\mathbb{I} + \sum_{j \ne i} \alpha_j^{-1} x_j x_j^\top$ where $x_j$ is the $j$th column of $X$, and using the determinant trick ($|C_{\not{i}} + \alpha_i^{-1} x_i x_i^\top| = |C_{\not{i}}|(1 + \alpha_i^{-1} x_i^\top C_{\not{i}}^{-1}x_i)$ and the matrix inversion lemma we can write $$\mathcal{L} = \frac{1}{2}\log(|C_{\not{i}}|) - \frac{1}{2} \log(1 + \alpha_i^{-1} x_i^\top C_{\not{i}}^{-1} x_i) - \frac{1}{2} y^\top C_{\not{i}}^{-1} y + \frac{1}{2} \left(\frac{(y^\top C_{\not{i}} x_i)^2}{\alpha_i + x_i^\top C_{\not{i}}^{-1}x_i}\right)$$ and we can compute our update by setting \begin{align*} \frac{\partial\mathcal{L}}{\partial \alpha_i} &= \frac{1}{2} \left(\frac{x_iC_{\not{i}}^{-1} x_i \alpha_i^{-1}}{\alpha_i + x_i^\top C_{\not{i}}^{-1} x_i}\right) - \frac{1}{2}\left(\frac{(y^\top C_{\not{i}}^{-1}x_i)^2}{\alpha_i + x_i^\top C_{\not{i}}^{-1} x_i}\right) = 0\\ \rightarrow \alpha_i = \frac{x_i^\top C_{\not{i}}^{-1} x_i}{(y^\top C_{\not{i}}^{-1}x_i)^2 + x_i^\top C_{\not{i}}^{-1} x_i} \end{align*} However, this is only a solution when it is positive (that is, $(y^\top C_{\not{i}}^{-1}x_i)^2 > x_i^\top C_{\not{i}}^{-1}x_i$) because $\alpha_i > 0$; otherwise, $\alpha_i \rightarrow \infty$ is always a solution. This produces the following algorithm: First, initialize $\alpha_i = 0$ for all $i$. Then, until convergence, for each $i \in \{1, \ldots, d\}$, construct $C_{\not{i}}$ and set $$\alpha_i = \frac{x_i^\top C_{\not{i}}^{-1} x_i}{(y^\top C_{\not{i}}^{-1}x_i)^2 + x_i^\top C_{\not{i}}^{-1} x_i}$$ when $(y^\top C_{\not{i}}^{-1}x_i)^2 > x_i^\top C_{\not{i}}^{-1}x_i$ and $\infty$ otherwise. The updates of $C_{\not{i}}$ can be rank-one, that is, numerically cheap. Once $A = \mathrm{diag}(\alpha_1, \ldots, \alpha_d)$ is computed we can learn the posterior on $\omega$ and use it for prediction by \begin{align*} y &\sim \mathcal{N}(X\omega, \sigma^2 \mathbb{I})\\ \omega &\sim \mathcal{N}(0, A^{-1})\\ p(\omega | y, X, A) &\propto p(y | \omega, X)p(\omega | A)\\ p(\omega | y, X, A) &= \mathcal{N}(\omega | \mu, \Sigma)\\ \mu &= \frac{\left(A + \frac{1}{\sigma^2}X^\top X\right)^{-1} X^\top y}{\sigma^2}\\ \Sigma &= \left(A + \frac{1}{\sigma^2}X^\top X\right)^{-1}\\ p(y_{new} | X_{new} y, X, A) &= \int p(y_{new} | X_{new}, \omega)p(\omega | y, X, A)d\omega\\ &= \mathcal{N}(y_{new} | \mu_{new}, \sigma_{new}^2)\\ \mu_{new} &= X_{new}^\top\mu\\ \sigma_{new}^2 &= \sigma^2 + X_{new}^\top \sigma X_{new}\\ \end{align*}

## Clustering

In the clustering setting we are given data $\{x_1, \ldots, x_N\}$ where $x_i \in \mathbb{R}^d$ without labels. We'd like to partition them into $K$ groups so that points in the same group are “similar” by some measure and points in different groups are “different”. If we measure similarity based on distance, then points within the same group will be close in $\mathbb{R}^d$ and points in different groups should be far apart.

### K-means

K-means is a non-probabilistic clustering model which shares many characteristics with probability-based ones. The K-means algorithm starts with an implicit assumption of there being $K$ clusters in the input data, each of which is characterized by its centroid, denoted $\mu_1, \ldots, \mu_K$ where $\mu_i \in \mathbb{R}^d$. Each datapoint is then assigned to the cluster whose centroid it is closest to. The K-means algorithm learns the location of the centroids, and therefore performs the clustering.

Let $c_n \in \mathbb{R}^{N \times K}$ be the cluster assignment vector so that $$c_{nj} = \begin{cases} 1,& x_n \in \mathrm{cluster \;} j\\ 0,& \mathrm{otherwise}\\ \end{cases}$$ We can formulate the “cost” of some clustering by computing the sum of the distances between each point and the cluster it is assigned to by $$J = \sum_{n = 1}^N \sum_{j = 1}^K c_{nj} \|x_n - \mu_j\|^2$$

#### Algorithm

Minimizing this objective leads naturally to an algorithm which iteratively alternates between updating $c_n$ for $n = 1, \ldots, N$ and then $\mu_j$ for $j = 1, \ldots, k$. More specifically, at each iteration, we hold $\mu_1, \ldots, \mu_K$ fixed and for each $n \in \{1, \ldots, N\}$ set $c_{nj} = 1$ when $j = \mathrm{arg}\min_i \|x_n - \mu_i\|^2$ and $0$ otherwise. Then, for each $j \in \{1, \ldots, K\}$ we hold $c_{nj}$ fixed and minimize $\sum_n c_{nj}\|x_n - \mu_j\|^2$ which we can differentiate and set equal to zero to obtain $$\mu_j = \frac{\sum_n c_{nj}x_n}{\sum_n c_{nj}}$$ which is just the mean of the data points in cluster $j$.

### Gaussian Mixture Models

GMMs are a probabilistic clustering scheme where the density of $\{x_1, \ldots, x_N\}, x_i \in \mathbb{R}^d$ is modeled by a mixture of $K$ Gaussians so that $$p(x | \mu_{1 : K}, \sigma_{1 : K}, \pi) = \sum_{j = 1}^K \pi_j \mathcal{N}(x, \mu_j, \sigma_j)$$ where $\pi \in \mathbb{R}^K$ is the probability vector so that $\pi_j < 0$ and $\sum_j \pi_j = 1$ and $\mu_j$ and $\Sigma_j$ are the mean and covariance for cluster $j$ respectively. If the data is assumed to be conditionally independent, the joint likelihood is given by $$p(x_1, \ldots, x_N | \pi, \mu, \Sigma) = \prod_{n = 1}^N p(x_n | \pi, \mu, \Sigma) = \prod_{n = 1}^N \left(\sum_{j = 1}^K \pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)\right)$$ which gives a log-likelihood $$\log(p(x_{1 : N} | \pi, \mu, \Sigma)) = \sum_n \log \left( \sum_j \pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)\right)$$ which is intractable due to the summation within the logarithm. However, we can make this quantity much easier to work with by introducing hidden data/latent variables to the model.

#### Latent Variables

We can add variables $c_n$ to the GMM model so that we keep the desired marginals. We want it to be the case that $$\sum_{c_n} p(x_n, c_n | \mu, \Sigma, \pi) = p(x_n | \mu, \Sigma, \pi) = \sum_{j = 1}^K \pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)$$ This leads to a hierarchical process where we sample $x_n | c_n \sim \mathcal{N}(\mu_{c_n}, \Sigma_{c_n})$ (where $\mu_{c_n}$ and $\Sigma_{c_n}$ denote (by abuse of notation) the mean and variance of the Gaussian which $x_n$ is assigned to) then sample $c_n \sim_{iid} \mathrm{Mult}(\pi)$. Summing over the likelihood gives $$\sum_{c_n} p(x_n, c_n | \mu, \Sigma, \pi) = \sum_{c_n} p(x_n | c_n, \mu, \Sigma) p(c_n | \pi) = \sum_{j = 1}^K \mathcal{N}(x_n | \mu_j, \Sigma_j)\pi_j$$ If we keep $c_n$ as part of the model, then the joint likelihood becomes \begin{align*} p(x_{1:N}, c_{1:N} | \mu, \Sigma, \pi) &= \prod_{n = 1}^N p(x_n, c_n | \mu, \Sigma, \pi)\\ &= \prod_{n = 1}^N p(x_n | c_n, \mu, \Sigma)p(c_n | \pi)\\ &= \prod_{n = 1}^N \prod_{j = 1}^K \mathcal{N}(x_n | \mu_j, \Sigma_j)^{c_{nj}} \pi_j^{c_{nj}} \end{align*} Note that there is no longer a summation, so our log-likelihood will be tractable. We can compute the log-likelihood by $$\log(p(x_{1:N}, c_{1:N} | \mu, \Sigma, \pi)) = \sum_{n = 1}^N \sum_{j = 1}^K c_{nj} \left(\log(\pi_j) - \frac{1}{2}(x_n - \mu_j)^\top \Sigma_j^{-1} (x_n - \mu_j) - \frac{1}{2}\log(|\Sigma_j|)\right) + \mathrm{const}$$

#### Maximum Likelihood Algorithm

As with K-means, we perform an iterative alternating procedure, first holding $\mu, \sigma, \pi$ fixed and maximizing over $c_n$ and then holding $c_n$ fixed and fitting $\pi, \mu, \Sigma$. Specifically, at each iteration, we first set $c_{nj} = 1$ when $j = \mathrm{arg}\max_i \left(\log(\pi_j) + \log(\mathcal{N}(x_n | \mu_j, \Sigma_j))\right)$ and $0$ otherwise for each $n \in \{1, \ldots, N\}$. Then, based on computing partial derivatives of the likelihood and setting equal to zero, we hold $c_{nj}$ fixed and set $$\mu_j = \frac{\sum_n c_{nj}x_n}{\sum_n c_{nj}}$$ (the mean of the data in each cluster), $$\Sigma_j = \frac{\sum_n c_{nj} (x_n - \mu_j)(x_n - \mu_j)^\top}{\sum_n c_{nj}}$$ (the covariance of the data in each cluster), and, using Lagrange multipliers to enforce $\sum_j \pi_j = 1$, $$\pi_j = \frac{\sum_n c_{nj}}{N}$$ (the empirical distribution of the assignments). Note that we are using maximum likelihood because $\mu, \Sigma$ and $\pi$ don't have priors. As a result, $\Sigma_j$ will be full rank of $\sum_n c_{nj} < d$, which is undesirable but could be handled with a prior on $\Sigma_j$. Furthermore, we can interpret the $c_n$ update step by noting that \begin{align*} \mathrm{arg}\max_j \left( \log(\pi_j) + \mathcal{N}(x_n | \mu_j, \Sigma_j)\right) &= \mathrm{arg}\max_j \pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)\\ &= \mathrm{arg}\max_j \frac{\pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)}{\sum_i \pi_i \mathcal{N}(x_n | \mu_i, \Sigma_i)} \end{align*} which is the MAP solution for $c_n$ because \begin{align*} p(c_{nj} = 1 | x_n, \pi, \mu, \Sigma) &\propto p(x_n | c_{nj} = 1, \mu, \Sigma)p(c_{nj} = 1 | \pi)\\ &\propto \mathcal{N}(x_n | \mu_j, \Sigma_j)\pi_j \end{align*} This suggests that we may be able to compute a posterior for $c_{nj}$ rather than doing a hard assignment.

### The EM Algorithm

Taking the GMM as an example, we want to learn the maximum likelihood solution to $$p(x_{1:N} | \theta) = \prod_{n = 1}^N p(x_n | \theta)$$ by maximizing $\sum_n \log(p(x_n | \theta))$. If we introduce the latent class assignment variables $c_n$ to obtain the joint distribution $p(x_n, c_n | \theta)$ we can marginalize to obtain an alternate form for $p(x_n | \theta)$ given by $$p(x_n | \theta) = \sum_{c_n} p(x_n, c_n | \theta)$$ We can therefore equivalently maximize $$\sum_{n = 1}^N \log\left(\sum_{c_n} p(x_n, c_n | \theta) \right)$$ which can be made tractable by assuming conditional conjugacy. That is, given the observed data $x_n$ and parameter setting $\theta^{(old)}$, we can compute a posterior distribution for the hidden data $p(c_n | x_n, \theta)$ analytically. In this way, we perform the optimization in stages, which is the core principle of the EM algorithm, which we can formulate as follows: First, given the current value of $\theta$ (called $\theta^{(old)}$), we calculate the posterior $p(c_n | x_n, \theta^{(old}))$. Second, we construct the objective function \begin{align*} Q(\theta, \theta^{(old)}) &= \sum_{n = 1}^N \sum_{c_n} p(c_n | x_n, \theta^{(old)})\log(p(x_n, c_n | \theta))\\ &= \sum_{n = 1}^N \mathbb{E}(\log(p(x_n, c_n | \theta)) | \theta^{(old)}) \end{align*} Finally, update $\theta$ by optimizing $Q$ by setting $\theta^{(old)} = \mathrm{arg}\max_\theta Q(\theta, \theta^{(old)})$. This process is repeated until convergence. Constructing the objective function is called the E (expectation) step, and maximizing $Q$ is called the M (maximization) step. For MAP estimation on $\theta$, we can add $\log(p(\theta))$ to $Q(\theta, \theta^{(old)})$ and then optimize. Generally, the algorithm is considered to have converged when the value of $\theta^{(old)}$ doesn't change much in the M step.

##### Example: GMM

In the GMM setting, $\theta = \{\mu_{1:K}, \Sigma_{1:K}, \pi\}$. Let $\gamma_{nj}$ be the posterior of $c_{nj}$ given by $$\gamma_{nj} = p(c_{nj} = 1 | x_n, \mu, \Sigma, \pi) = \frac{\pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)}{\sum_{i = 1}^K \pi_i \mathcal{N}(x_n | \mu_i, \Sigma_i)}$$ Now, from above we have $p(x_n, c_{nj} = 1 | \mu, \Sigma, \pi) = \log(\pi_j) + \log(\mathcal{N}(x_n | \mu_j, \Sigma_j))$ so in the E-step we set \begin{align*} Q(\theta, \theta^{(old)}) &= \sum_{n = 1}^N \sum_{j = 1}^K p(c_{nj} = 1 | x_n, \mu, \Sigma, \pi) \log(p(x_n, c_{nj} | \mu, \Sigma, \pi))\\ &= \sum_{n = 1}^N \sum_{j = 1}^K \gamma_{nj} \left(\log(\pi_j) - \frac{1}{2}(x_n - \mu_j)^\top \Sigma_j^{-1} (x_n - \mu_j) - \frac{1}{2} \log(|\Sigma_j|)\right) \end{align*} and in the M-step we compute the partial derivatives of $Q(\theta, \theta^{(old)})$ with respect to each parameter to set \begin{align*} \mu_j &= \frac{\sum_n \gamma_{nj}x_n}{\sum_n \gamma_{nj}}\\ \Sigma_j &= \frac{\sum_n \gamma_{nj} (x_n - \mu_j)(x_n - \mu_j)^\top}{\sum_n \gamma_{nj}}\\ \pi_j &= \frac{\sum_n \gamma_{nj}}{N} \end{align*} (using Lagrange multipliers as before). Note that this formulation is the same as above except that we have “soft” assignment.

#### Theoretical Motivation

For simplicity, assume we only have one observation $x$ so that we seek to maximize $\log(p(x | \theta))$ with respect to $\theta$. Given some other model variable $c$, we can marginalize to relate $p(c | x, \theta)p(x | theta) = p(x, c | \theta)$ so that $\log(p(x | theta)) = \log(p(x, c | \theta)) - \log(p(c | x, \theta))$. Now, we introduce some distribution $q(c)$ on $c$ whose parameters are hidden and whose form is unknown. We can now compute \begin{align*} &\log(p(x | \theta)) = \log(p(x, c | \theta)) - \log(p(c | x, \theta))\\ \rightarrow & q(c)\log(p(x | \theta)) = q(c)\log(p(x, c | \theta)) - q(c)\log(p(c | x, \theta))\\ \rightarrow & \sum_c q(c)\log(p(x | \theta)) = \sum_c q(c)\log(p(x, c | \theta)) - \sum_cq(c)\log(p(c | x, \theta))\\ \rightarrow & \log(p(x | \theta)) = \sum_c q(c)\log(p(x, c | \theta)) - \sum_c q(c)\log(p(c | x, \theta))\\ \rightarrow & \log(p(x | \theta)) = \sum_c q(c)\log(p(x, c | \theta)) - \sum_c q(c)\log(q(c)) + \sum_c q(c)\log(q(c)) - \sum_c q(c)\log(p(c | x, \theta))\\ \rightarrow & \log(p(x | \theta)) = \sum_c q(c)\log\left(\frac{p(x, c | \theta)}{q(c)}\right) + \sum_c q(c)\log\left(\frac{q(c)}{p(c | x, \theta)}\right) \end{align*} The term $$\sum_c q(c)\log\left(\frac{p(x, c | \theta)}{q(c)}\right)$$ is a loss function $\mathcal{L}(q, \theta)$, and the term $$\sum_c q(c)\log\left(\frac{q(c)}{p(c | x, \theta)}\right)$$ is the KL divergence between $q$ and $p$ (which is zero when $q = p$). So, the EM algorithm is a method for maximizing $\log(p(x | \theta))$ by working instead with $\mathcal{L}$ and the KL divergence in a two-step process. In the E step, we set $q(c) = p(c | x, \theta)$ which implicitly sets $\mathrm{KL}(q || p) = 0$ and leaves $\log(p(x | \theta))$ unchanged because $\theta$ is fixed to $\theta^{(old)}$ in the E step. This increases $\mathcal{L}(q, \theta)$ to $\log(p(x | \theta))$. In the M-step, we maximize $\mathcal{L}(q(c | x, \theta^{(old)}))$ with respect to $\theta$, which is equivalent to setting $$\theta^{(new)} = \mathrm{arg}\max_\theta Q(\theta, \theta^{(old)}) = \mathrm{arg}\max_\theta \sum_c p(c | x, \theta^{(old)}) \log(p(x, c | \theta))$$ This also increases $\mathcal{L}$. Furthermore, $q(c)$ (which is defined as $p(c | x, \theta^{(old)})$) is no longer equal to $p(c | x, \theta^{(new)})$, so $\mathrm{KL}(q || p)$ has been increased too (because the KL divergence is always positive and is only 0 when $p = q$). Our new value $\theta^{(new)}$ therefore satisfies $\log(p(x | \theta^{(new)})) > \log(p(x | \theta^{(old)}))$. Now that $q(c) \ne p(c | x, \theta^{(new)})$, we can return to the E-step.

##### Example: Sparse Regression

Recall that in the sparse regression setting we have data $X \in \mathbb{R}^{N \times d}$ with labels $y \in \mathbb{R}^N$, and we model $y \sim \mathcal{N}(X\omega, \sigma^2 \mathbb{I})$ and $\omega_k \sim_{iid} \mathcal{N}(0, \alpha_k^{-1})$ for $k \in \{1, \ldots, d\}$. We can estimate $$\alpha_{1:d} = \mathrm{arg}\max_{\alpha_{1 : d}} \log(p(y | X, \alpha_{1:d}))$$ using the EM algorithm. First, note that we have $$p(y | X, \alpha_{1 : d}) = \int_\omega p(y, \omega | X, \alpha_{1:d}) d\omega$$ so intuitively we let $\omega$ be the latent variables in this model. In the E-step, we calculate $p(\omega | y, X, \alpha_{1 : d}^{(old)}) = \mathcal{N}(\mu, \Sigma)$ by setting $$\mu = \left(A + \frac{1}{\sigma^2} X^\top X\right)^{-1} \frac{X^\top y}{\sigma^2}$$ and $$\Sigma = \left(A + \frac{1}{\sigma^2} X^\top X\right)^{-1}$$ Then, we construct ????????????????? \begin{align*} Q(\alpha_{1:d}, \alpha_{1:d}^{(old)}) &= \int p(\omega | y, X, \alpha_{1:d}^{(old)}) \log(p(y, \omega | x, \alpha_{1:d}))d\omega\\ &= -\frac{1}{2} \left(y^\top y - 2y^\top X \mathbb{E}[\omega] + \mathbb{E}[\omega^\top X^\top X \omega]\right) - \sum_{k = 1}^d \frac{\alpha_k}{2} \mathbb{E}[\omega_k^2] + \frac{1}{2}\log(\alpha_k) + \mathrm{const} \end{align*} And in the M-step, we set $$\frac{\partial Q(\alpha_{1:d}, \alpha_{1:d}^{(old)})}{\partial \alpha_k} = 0$$ Setting equal to zero gives $$\frac{\partial Q}{\partial \alpha_k} = -\frac{1}{2} \mathbb{E}[\omega_k^2] + \frac{1}{2\alpha_k} = 0 \rightarrow \alpha_k = \frac{1}{\mathbb{E}[\omega_k^2]} = \frac{1}{\mu_k^2 + \Sigma_{k, k}}$$