# http://colinraffel.com/wiki/

### Site Tools

on_weighted_gaussian_integration_of_feature_bags

# On Weighted Gaussian Integration of Feature Bags

## Problem Setting

In the bag-of-features setting, we are given a set of matrices (called “bags”) whose columns are feature vectors. Given a collection of labeled bags, we hope to be able to recover labels for unlabeled bags. More formally, we denote $A_k \in \Re^{m \times \eta_k}$ as the $k$th of $n$ bags in our dataset, consisting of $\eta_k$ feature vectors each of size $m$. Note that we do not require each bag to hold the same number of feature vectors - the matrices $A_k$ are not necessarily of uniform size. Each $A_k$ has a corresponding label $y_k \in [1, 2, \ldots, p]$. Given a query feature bag $B$ with unknown labels, we would like to assign a label in $[1, 2, \ldots, p]$ based on our training data.

A straightforward approach would be to train a classifier to assign each column of each $A_k$ with the corresponding label $y_k$ (such that each $A_k$ is just treated as $\eta_k$ independent feature vectors each with label $y_k$). Then, given a feature bag $B$ with unknown label, we simply predict labels for each column of $B$. Finally, we can use a “voting” formulation to assign $B$ the label which was most commonly predicted for its feature vectors. While straightforward, this approach ignores the fact that there may be some relationship between the individual feature vectors of a bag. In addition, it it unnecessarily computationally complex.

## Bag Integration

We may be able to get better results with lower computational complexity by only classifying a single “integrated” feature vector for each bag. In other words, we seek a mapping $\phi: \Re^{m \times \eta_k} \rightarrow \Re^d$ where $\phi(A_k)$ is the integrated feature vector to be used in classification. Note that because our bags are not of uniform size, $\phi$ will need to handle matrices with an arbitrary number of columns.

### Gaussian Bag Integration

A common choice for $\phi$ is $\phi_\mu: \Re^{m \times \eta_k} \rightarrow \Re^m$, which computes the mean of the feature bag by

$$\phi_\mu(A_k) = A_k \frac{\vec{1}_{\eta_k}}{\eta_k}$$

where $\vec{1}_{\eta_k} \in \Re^{\eta_k}$ is a vector of all 1s. Note that $\phi_\mu$ will produce a feature vector of uniform size regardless of the value of $\eta_k$. This approach is simple but has its shortcomings. First of all, it assumes that all feature vectors are of equal importance for classification - it may be that some feature vectors are noisy or irrelevant. In addition, it ignores the ordering of the feature vectors which may be informative for classification. Finally, it assumes that the distribution of feature vectors are well characterized by their mean.

A simple way to address the independence assumptions of $\phi_{\mu}$ is to compute the full covariance matrix of each bag, giving

$$\phi_{\Sigma}(A_k) = \frac{U_k U_k^\top}{\eta_k - 1}$$

where $U_k$ is the bag $A_k$ with its row-wise mean subtracted, given by

$$U_k = A_k - \phi_\mu( A_k )\vec{1}_{\eta_k}^\top$$

In practice, the covariance matrix is flattened and stacked with the mean vector. Because the covariance matrix is symmetric, typically only the upper or lower triangle is used, giving the integration function $\phi_{\mu, \Sigma} : \Re^{m \times \eta_k} \rightarrow \Re^{m(m+3)/2}$. This transformation also assumes all feature vectors are equally important and that their order is irrelevant. An equally common transformation is $\phi_{\mu, \sigma^2}: \Re^{m \times \eta_k} \rightarrow \Re^{2m}$ which is defined as the concatenation of $\phi_\mu(A_k)$ and the diagonal of the covariance matrix,

$$\phi_{\sigma^2}(A_k) = \mathrm{diag}\left[\frac{U_k U_k^\top}{\eta_k - 1}\right]$$

In descriptive terms, $\phi_{\mu, \sigma^2}$ just computes the mean and variance of the feature vectors in a bag and concatenates them, thereby assuming that the feature vectors are vectors Gaussian-distributed independent random variables.

### Weighted Gaussian Bag Integration

An intuitive way to deal with the fact that Gaussian bag integration considers all feature vectors equal would be to instead calculate a weighted average and variance. More specifically, we can compute

$$\hat{\phi}_\mu(A_k) = A_k\omega_k$$

$\omega_k \in \Re^{\eta_k}$ is a vector of weights for the feature vectors in bag $A_k$. Here, the dimensionality of the weighting vector must match the number of feature vectors in a given bag. This technique can similarly be applied to the covariance and variance integration functions as

$$\hat{\phi}_{\Sigma}(A_k) = \frac{(\vec{1}_{m} \omega_k^\top \circ \hat{U}_k) \hat{U}_k^\top}{\eta_k - 1}$$

$$\hat{\phi}_{\sigma^2}(A_k) = \mathrm{diag}\left[\frac{(\vec{1}_{m} \omega_k^\top \circ \hat{U}_k) \hat{U}_k^\top}{\eta_k - 1}\right]$$

where

$$\hat{U}_k = A_k - \hat{\phi}_\mu( A_k )\vec{1}_{\eta_k}^\top$$

and $\circ$ denotes the Hadamard product. Note that setting $\omega_k = \vec{1}_{\eta_k}/\eta_k \forall k$ makes $\hat{\phi}_\mu$, $\hat{\phi}_\Sigma$ and $\hat{\phi}_{\sigma^2}$ equivalent to $\phi_\mu$, $\phi_\Sigma$ and $\phi_{\sigma^2}$ respectively.

From this definition a straightforward approach would be compute a separate weighting vector for every bag. The fact that the number of feature vectors in each bag varies motivates the definition of a function $\Omega: \Re^{m \times \eta_k} \rightarrow \Re^{\eta_k}$ which, given a feature vector, produces a weight for the feature vector. We can then compute $\omega_k = \Omega(A_k)$. Intuitively, a good function $\Omega$ can be thought of as calculating a weight for each feature vector based on the feature vectors themselves which should represent how relatively “useful” the feature vector is in classification.

We note here that the proposed weighted Gaussian integration approach does not address the assumption that the feature values are Gaussian distributed. However, the general approach of assigning a weight to feature vectors according to their usefulness in discriminating between classes could be carried over in a straightforward way to other model assumptions. For the remainder of this work we will assume that the feature values are well characterized by their mean and variance in each bag.

## Finding $\Omega$

Based on the above discussion, we would ideally like to find a $\Omega$ which produces a weighting vector for each bag $A_k$ such that the classification accuracy of our bags using the integration function $\hat{\phi}_{\mu, \sigma^2}$ is maximized. One possibility would be to learn $\Omega$ using some form of regression. This technique would require a set of training bags for which we know a priori the optimal weighting, so that we could learn a mapping from the feature vectors to their optimal weight values. Realistically, however, the optimal weighting is not known and would need to be estimated. This approach would also rely on the quality of the regression technique.

Based on these requirements, a logical approach would be to simultaneously estimate the optimal weights of a training set and the function $\Omega$ used to generate them. In other words, we would like to maximize the classification accuracy over the set of all possible functions $\Omega$. Clearly this optimization would be infeasible without constraints on the form of $\Omega$, so we constrain the possible functions $\Omega$ to be linear functions of each feature vector.

More specifically, we will focus on functions of the form

$$\Omega(A_k) = \frac{A_k^\top q + c\vec{1}_{\eta_k}}{\eta_k}$$

where $q \in \Re^m, c \in \Re$. Importantly, neither parameter depends on $\eta_k$. A function of this form can be thought of as choosing a linear combination of features to generate a value which denotes how useful the feature vector is for classification. The term $c\vec{1}_{\eta_k}$ is a bias term so that $\Omega_k$ generalizes to the unweighted case when $q$ is a vector of all zeros and $c = 1$.

### Clustering Heuristic

To estimate the parameters $q$ and $c$, we first propose a clustering heuristic motivated by Fisher's Linear Discriminant Analysis (LDA). This formulation relies on the assumption that decreasing in-class distance and increasing out-of-class distance is beneficial for discrimination accuracy. More concretely, we seek $q$ and $c$ such that the distance between mean-integrated feature vectors is minimized for vectors with the same label and maximized for vectors with different labels. Denoting $C$ as the set of all unique pairs of integers chosen from $1, \ldots n$ without replacement where order doesn't matter ($|C| = \binom{n}{2}$), we define

$$S = \{(i, j) \in C \;|\; y_i = y_j\}$$

The distance between two integrated feature bags $A_i$ and $A_j$ can be written as

$$D(A_i, A_j) = ||A_i\Omega(A_i) - A_j\Omega(A_j)||_2^2$$

so our objective is therefore to minimize this distance for $(i, j) \in S$ and maximize it for $(i, j) \in S^\mathrm{C}$. Writing our linear form of $\Omega$ described above, we have

$$D(A_i, A_j) = ||A_i\left(\frac{A_i^\top q + c\vec{1}_{\eta_i}}{\eta_i}\right) - A_j\left(\frac{A_j^\top q + c\vec{1}_{\eta_j}}{\eta_j}\right)||_2^2$$

If we define

$$\bar{A}_k = \frac{[A_kA_k^\top | A_k\vec{1}_{\eta{k}}]}{\eta_k} \in \Re^{m \times m + 1}$$

(the horizontal concatenation of the matrix $A_kA_k^\top$ and the vector $A_k\vec{1}_{\eta_k}$) and $r = [q; c] \in \Re^{m+1}$ (the vertical concatenation of the vector $q$ and the scalar $c$) then we have

$$D(A_i, A_j) = ||\bar{A}_ir - \bar{A}_jr||_2^2 = ||(\bar{A}_i - \bar{A}_j)r||_2^2 = r^\top (\bar{A}_i - \bar{A}_j)^\top (\bar{A}_i - \bar{A}_j) r$$

A straightforward way to maximize the out of class distance and minimize the in-class distance would then be solve

$$\mathrm{arg}\max_{r} \frac{r^\top \left(\sum_{(i, j) \in S^\mathrm{C}}(\bar{A}_i - \bar{A}_j)^\top (\bar{A}_i - \bar{A}_j)\right) r}{r^\top \left(\sum_{(i, j) \in S}(\bar{A}_i - \bar{A}_j)^\top (\bar{A}_i - \bar{A}_j)\right) r}$$

This is a generalized eigenvalue problem, with the optimal $r$ given by the eigenvector corresponding to the largest generalized eigenvalue. This provides a straightforward way to estimate the parameters $q$ and $c$.

An important note is that this simplification to a generalized eigenvalue problem does not hold for the weighted covariance and variance calculations. For this reason, we propose estimating the parameters $q$ and $c$ for the weighted mean case and using the same values for the weighted variance and covariance calculations.

An additional caveat is that the solution to the generalized eigenvalue problem will not necessarily be nonnegative or sum to one (although it will satisfy $r^\top r = 1$). Furthermore, there is no guarantee that the weighting vector $\omega_k$ for each bag will be nonnegative or sum to one. This is counterintuitive, but the generalized eigenvalue formulation will provide an optimal solution regardless of its interpretability.

### Simultaneous Estimation

While the clustering heuristic described above is intuitive, it does not directly optimize $q$ and $b$ to maximize accuracy for a specific classifier. For this reason, we explore the simultaneous optimization of a classifier model and these parameters. For simplicity, we will focus on the linear support vector machine (SVM).

Given a set of input vectors $x_i, i \in [1, \ldots, n]$ and labels $y_i \in \{\pm 1\}$ the support vector machine algorithm seeks a hyperplane of the form $wx_i - b$ such that $||w||$ is minimized and $y_i(wx_i - b) \ge 1$. This optimization produces a $w$ and $b$ which describe a hyperplane with maximal margin between the vectors labeled $1$ and those labeled $-1$.

In the weighted mean setting where the weighting function $\Omega$ is linear, the input vectors are given by

$$x_i = A_i\left(\frac{A_i^\top q + c\vec{1}_{\eta_i}}{\eta_i}\right)$$

which, using the substitutions described above, can be written as

$$x_i = \bar{A}_i r$$

If we simultaneously optimize $w, b, q, c$, the SVM problem becomes

$$\mathrm{arg}\min_{w, b, r} ||w|| \mathrm{\;s.t.\;} y_i(w\bar{A}_i r - b) \ge 1$$