# http://colinraffel.com/wiki/

### Site Tools

eecs_e6894_course_notes

## Theory

### Deep Networks

A deep network is a cascade of nonlinear functions which is computed recursively. That it, it can be viewed as a sequence of layers where the output of one layer is computed based on the output of the previous layer. If $o_n$ is the output of the $n$th layer, then $o_n = L_n(L_{n - 1}(\cdots(L_1(x)))$ (forward propagation). To actually utilize this network for some task, we first define a loss function which compares the output of the network $o_n$ to some target value $y$. To optimize the network for a task, the gradient of the loss function is computed with respect to the network output. By the chain rule, computing the gradient of the loss function with respect to the output can be done as $$\frac{\partial C}{\partial o_1} = \frac{\partial C}{\partial o_n} \frac{\partial o_n}{\partial o_{n - 1}} \cdots \frac{\partial o_2}{\partial o_1}$$ (backward propagation).

If we suppose each layer has the form $o_l = L_l(x) = f_l(w^\top x + b)$ where $f_l$ is some nonlinearity, $w$ is a matrix, and $b$ is a vector. We can then compute, using the chain rule, $$\frac{\partial C}{\partial w} = \sum_i \frac{\partial C}{\partial o_l} f^\prime _l x_i$$ …

Now that we have computed the gradient of the loss function with respect to the parameters, we can use it to adjust them so as to lower the loss function by setting $$w \leftarrow w - \alpha \frac{\partial C}{\partial w}, b \leftarrow b - \alpha \frac{\partial C}{\partial b}$$ This is gradient descent, where $\alpha$ is the learning rate hyperparameter. These updates are done for each parameter in each layer; here we are just showing one layer's parameter updates.

### Optimizing Loss Functions

Recently, larger datasets and more computational power have become available. In order to take advantage of this, we need a method which can learn from huge amounts of data efficiently. In a typical machine learning problem, we are trying to minimize a cost $C$ (a function of our input data $x_i$ and output data $y_i, i \in i, \ldots, N$ in our training set) with respect to some parameter $w$, or equivalently $$\min_w \frac{1}{N} \sum_{i = 1}^N C(x_i, y_i | w)$$ Given some testing set $T$ of examples outside of the training set, we hope that minimizing the above cost also minimizes $$\min_w \frac{1}{|T|} \sum_{j \in T} C(x_j, y_j | w)$$ If $C$ is convex, differentiable, and continuous we can try to use gradient descent $$w \leftarrow \alpha \frac{1}{N} \sum_{i = 1}^N \frac{\partial C(x_i, y_i | w)}{\partial W}$$ By updating our parameter $w$ in this way, we can ensure that we are minimizing the loss function as described above. In practice, the choice of $\alpha$ can prevent convergence if it's too large, and if it's too small convergence can be very slow. An alternative is Newton's method, where we use the inverse of the Hessian $\Gamma_t$ to update $$W_{t + 1} = w_t - \Gamma_t \frac{1}{N} \sum_{i = 1}^N \nabla_w C(x_i, y_i | w)$$ However, the inverse of the Hessian can be extremely expensive to compute for large training sets. A different alternative is to use coordinate descent, where instead of updating our parameter with a vector, we just move on the axis of greatest decrease. There are a number of other variance for this kind of optimization.

A difficulty with the above techniques is that they all depend on the entire data set; i.e. at every parameter update we have to compute a sum which has one term for each entry in the data set. Even worse, we may not be able to load the entire dataset into memory. As an alternative, instead of summing over all examples, we can randomly choose one example and use it to update the parameter, by $$w \leftarrow w - \alpha_t \frac{\partial C(x_t, y_t | w)}{\partial w}$$ We are effectively using the gradient with respect to a single example as an approximation of the sum of the gradients with respect to all samples. Stochastic gradient descent will always converge if $$\sum_t \alpha_t^2 < \infty, \sum_t \alpha_t = \infty$$ However, in practice we just choose some small $\alpha_t$ which decreases over time. Because of its simplicity, it's much easier to implement than most other optimization techniques. Because there's no sum over all training examples, one iteration of SGD is much faster than one iteration of gradient descent. However, it is not guaranteed to find the global minimum, but for some problems more data with a good local minimum is acceptably close to a global minimum. This is especially true when the cost function is not convex or when the training and testing sets have different distributions. In the case of deep networks, the update is computed based on the forward and backward pass over a single example. This mainly boils down computationally to matrix multiplications, so SGD on deep networks is particularly well suited to GPUs.

### Logistic Regression

Logistic regression can be viewed as a single layer neural network with a negative log-likelihood cost function. It is often used as the last layer of a neural network. Its output can be computed by $$p(x) = Pr(y = 1 | x) = \frac{\exp(\beta^\top x)}{1 + \exp(\beta^\top x)}$$ with cost function $$L(\beta) = \sum_{i = 1}^N C(x_i, y_i) = \sum_{i = 1}^N (y_i \log(p(x)) + (1 - y_i) \log(1 - p(x))$$ It turns out that optimizing this cost function with respect to $\beta$ is convex. Typically, an additional dimension is added to the input features which is always $1$ so that computing $\beta^\top$ includes and implicit bias term. Using gradient descent, we can derive that $$\frac{\partial L(\beta)}{\partial \beta} = X^\top (y - p)$$ where $X$ is the concatenation of all training vectors, and using gradient descent therefore sets $$\beta = \beta - \alpha \frac{\partial L(\beta)}{\partial \beta}$$ where $\alpha$ is the learning rate. Alternatively, Newton's method utilizes the Hessian $H = \nabla^2 L(\beta) = -X^\top WX$ (where $W$ is a diagonal matrix with elements $p(x)(1 - p(x))$) to use the update $$\beta = \beta - H^{-1} \frac{\partial L(\beta)}{\partial \beta}$$ Newton's method will converge faster, but a lot of computation is required to compute $H^{-1}$ particularly when the training set is large. We can also use stochastic gradient descent, which is equivalent to gradient descent except the gradient is only computed with respect to a single random example. Newton's method will always find the globally best parameter setting of $\beta$; gradient descent will provided that the learning rate is set appropriately, and SGD won't. SGD will usually converge on a solution having seen fewer examples than gradient descent.

### Multi-Layer Perceptron

A perceptron is a function which sets $$f(x) = \begin{cases} 1 & w^\top x + b > 0\\ 0 & \mathrm{otherwise} \end{cases}$$ A single perceptron can't solve all problems (e.g. XOR). However, if we cascade them to form a “multi-layer perceptron”, they are universal function approximators. Currently, we often use the formulation $f(x) = \phi(w^\top x + b)$ where $\phi$ is some nonlinearity. Some of these nonlinearities are meant to approximate the thresholding used in the perceptron; others aren't. Optimizing a cost function with respect to the layer parameters is nonconvex, so are not guaranteed to find a global minimum no matter what optimizer we use. So, in practice, SGD is usually used. For better performance, the weight matrices are usually initialized randomly, small batches are used instead of single examples, and the learning rate is selected using multiple training trials.

### Convolutional Networks

In a convolutional network, instead of computing a linear mix $w^\top x$ at each layer, a convolution is computed with a set of filters. After filtering, the result is “pooled”, i.e. aggregated over local neighborhoods. The filtering and pooling builds invariance to deformations of the input. The use of many such layers often builds a hierarchical representation which can be useful.