# http://colinraffel.com/wiki/

### Site Tools

expectation_backpropagation

# Expectation Backpropagation

 Authors Daniel Soudry, Itay Hubara, Ron Meir Publication info Retrieved here Retrieval date 10/06/14

Proposes a method for training neural networks with an expectation propagation-based approach which approximates the posterior of the weights given the data using a mean-field factorized distribution.

## Motivation/notation

Backpropagation isn't applicable to neural nets with binary weights, which can be implemented quickly in hardware. Bayesian inference can be used to train a single layer neural network with binary weights by approximating the input to each neuron as a Gaussian random variable; the present work extends this idea to multi-layer networks. The resulting training algorithm is parameter-free, insensitive to input magnitude, and more efficient than Monte Carlo methods.

This work focuses on binary MNNs, where $L$ is the number of layers, $V_l$ is the width of layer $l$, $\mathcal{W} = \{W_l \in \mathbb{R}^{V_l \times V_{l - 1}}\}_{l = 1}^L$ is the set of weight matrices where $\mathcal{W}$ is constrained to some set $\mathcal{S}$ such that $W_{ij,l} \in S_{ij, l}$, $\{v_l\}_{l = 0}^L$ are the outputs of each layer ($v_0$ is the net input and $v_L$ is the output) computed by $v_l = \mathrm{sign}(W_lv_{l - 1})$, and $g(v_0, \mathcal{W})$ is the composite network function. For simplicity, the layer fan-in $K_l = |\{j | S_{ij,l} \ne \{0\}\}|$ is constant for all $i$. Given a dataset $D_N = \{x^{(n)}, y^{(n)}\}_{n = 1}^N$ (note that $D_0 = \{\}$) of pairs of data points $x^{(n)}$ and labels $y^{(n)} \in \mathcal{Y} \subset \{-1, +1\}^{V_L}$, we assume that for all $n$ the relation $x^{(n)} \rightarrow y^{(n)}$ can be represented by an MNN from the hypothesis class (some known architecture); that is there exists some $\mathcal{W} \in \mathcal{S}$ so that $y^{(n)} = g(x^{(n)}, \mathcal{W}^\ast)$. We seek to approximate $W^\ast$ and estimate the most probably $y$ given some $x$.

## Bayesian approximation to learning

We assume some prior $P(\mathcal{W}|D_0)$ on the weights and seek the posterior $P(\mathcal{W}|D_N)$, from which we can get the MAP estimate $\mathcal{W}^\ast = \mathrm{argmax}_{\mathcal{W} \in \mathcal{S}} P(\mathcal{W} | D_N)$ by minimizing the expected zero-one loss over the weights to obtain approximate labels by $y = g(x, \mathcal{W}^\ast)$. Alternatively, we can minimize the expected loss over the dataset: $$y^\ast = \mathrm{argmax}_{y \in \mathcal{Y}} \sum_{\mathcal{W}} \mathcal{I} \{g(x, \mathcal{W}) = y\}P(\mathcal{W} | D_N)$$ which effectively finds the $y$ which minimizes the average zero-one loss $\mathcal{I}\{y^\ast \ne g(x, \mathcal{W})\}$ of all possible networks. In an online setting, we update the posterior by $P(\mathcal{W}|D_n) \propto P(y^{(n)} | x^{(n)}, \mathcal{W}) P(\mathcal{W}|D_{n - 1})$, where the per-data-point likelihood is simply $P(y^{(n)} | x^{(n)}, \mathcal{W}) = \mathcal{I}\{g(x^{(n)}, \mathcal{W}) = y^{(n)}\}$. Note that when $g(x^{(k)}, \mathcal{W}^0) \ne y^{(k)}$, the update sets $P(\mathcal{W}^0, D_{k + 1}) = 0$. There are an exponential number of terms in the update $P(\mathcal{W}|D_n)$, so an approximation is needed.

First, instead of storing $P(\mathcal{W}|D_n)$, we will store the factorized mean field approximation $$\hat{P}(\mathcal{W} | D_n) = \prod_{i, j, l} \hat{P} (W_{ij, l} | D_n)$$ where each factor must be normalized. In this approximation we can find the MAP estimate of the weights by $W^\ast_{ij, l} = \mathrm{argmax}_{W_{ij, l} \in S_{ij, l}} \hat{P}(W_{ij, l} | D_N)$. The update becomes $\hat{P}(W_{ij, l} | D_n) \propto \hat{P}(y^{(n)} | x^{(n)}, W_{ij, l}, D_{n - 1}) \hat{P}(W_{ij, l} | D_{n - 1})$ where the marginal likelihood term is computed by $$\hat{P}(y^{(n)} | x^{(n)}, W_{ij, l}, D_{n - 1}) = \sum_{\mathcal{W}^\prime : W^\prime_{ij, l} = W_{ij, l}} P(y^{(n)} | x^{(n)}, \mathcal{W}^\prime) \prod_{\{k, r, m\} \ne \{i, j, l\}} \hat{P} (W^\prime_{kr, m} | D_{n - 1})$$ However, the sum over $\mathcal{W}^\prime : W^\prime_{ij, l} = W_{ij, l}$ is intractable. For simplicity, we rewrite the above equation as $$P(y|W_{ij, l}) = \sum_{\mathcal{W}^\prime : W^\prime_{ij, l} = W_{ij, l}} P(y | \mathcal{W}^\prime) \prod_{\{k, r, m\} \ne \{i, j, l\}} P(W^\prime_{kr, m})$$ The term $P(y | \mathcal{W}^\prime)$ is just an indicator of the network output; we can compute it layer by layer by defining $$P(v_m | v_{m - 1}) = \sum_{W^\prime_m} \prod_{k = 1}^{V_m} \left[\mathcal{I}\left\{ v_{k, m} \sum_{r = 1}^{V_m - 1} v_{r, m - 1} W^\prime_{kr, m} > 0 \right\} \prod_{r = 1}^{V_m - 1} P(W^\prime_{kr, m})\right]$$ We can then recursively compute, for a fixed $W_{ij, l}$ \begin{align*} P(v_1) &= P(v_1 | v_0 = x)\\ \forall m \in \{2, \ldots, l - 1\} : P(v_m) &= \sum_{v_{m - 1}} P(v_m | v_{m - 1}) P(v_{m - 1})\\ P(v_l | W_{ij, l}) &= \sum_{v_{l - 1}} P(v_l | v_{l - 1}, W_{ij, l}) P(v_{l - 1}\\ \forall m \in \{l + 1, l + 2, \ldots, L\} : P(v_m | W_{ij, l}) &= \sum_{v_{m - 1}} P(v_m | v_{m - 1})P(v_{m - 1} | W_{ij, l}) \end{align*} These summations are also intractable, so we need to simplify them. In the mean-field approximation, $W_{ij, l}$ are independent and in the limit of infinite fan-in the normalized input to each neuronal layer can be modeled as a Gaussian: $\forall m : u_m = W_m v_{m - 1} / \sqrt{K_m} \sim \mathcal{N}(\mu_m, \Sigma_m)$. This allows us to calculate the distribution for $u_m$ and $v_m$ for any given $v_0$ and $W_{ij, l}$, which gives $P(y | W_{ij, l}$ which we can use to compute the Bayes update of the weights and outputs. The computational complexity is lower when $\Sigma_m$ is diagonal; otherwise, the distribution of $u_m$ is approximated using its mean-field version. Computing $P(v_L = y | W_{ij, l})$ for every $i, j, l$ is wasteful; instead, $\log(P(v_L = y|W_{ij, l}))$ is approximated using a Taylor expansion of $W_{ij, l}$ around its mean $\langle W_{ij, l} \rangle$, giving $\Delta_{k, m} = \partial \log(P(v_L = y))/\partial \mu_{k, m}$.

## The Expectation Backpropagation Algorithm

The EBP algorithm proceeds as follows: First, given input $x$ and desired $y$, compute the mean output $\langle v_l \rangle$ for each layer by setting $\langle v_{k, 0} \rangle = x_k$ and recursively computing \begin{align*} \mu_{k, m} &= \frac{1}{\sqrt{K_m}} \sum_{r = 1}^{V_{m - 1}} \langle W_{kr, m} \rangle \langle v_{r, m - 1} \rangle ; \langle v_{k, m} \rangle = 2\Phi(\mu_{k, m}/\sigma_{k, m}) - 1\\ \sigma^2_{k, m} &= \frac{1}{K_m} \sum_{r = 1}^{V_{m - 1}} \langle W^2_{kr, m} \rangle (\delta_{m, 1} (\langle v_{r, m - 1}\rangle^2 - 1) + 1) - \langle W_{kr, m} \rangle^2 \langle v_{r, m - 1} \rangle^2 \end{align*} where $\delta_{ij} = \mathcal{I}(i = j)$ and $\Phi(x) = \int_{-\infty}^x \mathcal{N}(u | 0, 1)du$. Second, update $P(W_{ij, l} | D_n)$ for all weights by first initializing $$\Delta_{i, L} = y_i \frac{\mathcal{N}(0 | \mu_{i, L}, \sigma^2_{i, L})}{\Phi(y_i\mu_{i, L}/\sigma_{i, L})}$$ then computing \begin{align*} \Delta_{i, l - 1} &= \frac{2}{\sqrt{K_l}} \mathcal{N}(0 | \mu_{i, l - 1}, \sigma^2_{i, l - 1}) \sum_{j = 1}^{V_m} \langle W_{ji, l} \rangle \Delta_{j, l}\\ \log(P(W_{ij, l} | D_n)) &= \log(P(W_{ij, l} | D_{n - 1})) + \frac{1}{\sqrt{K_l}} W_{ij, l} \Delta_{i, l} \langle v_{j, l - 1} \rangle + C \end{align*} where $C$ is a constant. Finally, to compute the output, we can either use the MAP estimate $W_{ij, l}^\ast = \mathrm{argmax}_{W_{ij, l} \in S_{ij, l}} \log(P(W_{ij, l} | D_n))$ to compute $g(x, W^\ast)$ (the “deterministic output”), or we can compute the MAP output (the “probabilistic output”) directly by $$y^\ast = \mathrm{argmax}_{y \in \mathcal{Y}} \log(P(v_L = y)) = \mathrm{argmax}_{y \in \mathcal{Y}} \left[\sum_k \log\left(\frac{1 + \langle v_k, L \rangle}{1 - \langle v_k, L \rangle} \right)^{y_k} \right]$$ Using the EBP algorithm to train binary networks using the probablistic output yielded the best performance on a variety of text classification tasks.