User Tools

Site Tools


Accessing Minimal-Impact Personal Audio Archives

Authors Daniel P.W. Ellis and Keansub Lee
Publication Info Retrieved here
Retrieval date 9/13/12

Discusses the automatic segmentation and labeling of personal audio archives.

Personal Audio

We have impressive skills of memory but are much more error-prone than digital storage systems. If we record the audio of our everyday lives, it would be useful to have a system which automatically allows for content-based retrieval because so much of the audio will not be interesting or useful. Recent availability of cheap and small devices which can make day-length audio recordings have made personal audio archives possible, but manually reviewing and segmenting a day-length piece of audio would be nearly impossible. This prompts the development of tools and techniques to automatically segment and classify this audio data.

Potential Uses

Recording of audio could be useful because it's robust to sensor position and orientation and can complement other sensors (for example, video) by making certain information more accessible. Audio can provide clues to location (based on ambient sounds), activity (typing on a computer, having a conversation, walking), people (speaker identification), and words spoken.


The idea of passive storage of experiences was proposed nearly decades ago and was first realized using cameras. A more recent experiment classified and segmented personal audio using features from simultaneous video. A good deal of work already exists on classifying, clustering, and segmenting audio based on its acoustic features.

Automatic Segmentation and Clustering

When a series of day-length recordings are being generated, segmenting the audio into short (say, five-minute) segments and clustering similar segments could help track down happenings of a specific category. To create these segments, frame-length acoustic features were calculated and their statistics were summarized into longer frames of up to two minutes. An STFT with a frame length of 25 ms and a hop size of 10 ms was taken and summarized by a number of features. One feature was the linear frequency spectrum, where consecutive bin square magnitudes are weighted and summed to into 21 output bins. The auditory spectrum is similarly calculated, except that the input bins collected in each output bin depend on an approximation (Bark bands) of how we perceive frequencies. MFCCs were also calculated. A new feature, the spectral energy, was calculated in each subband as <math>H[n, j] = -\sum_{k=0}^{N/2+1} \frac{w_{jk}X[n, k]}{A[n, j]}\log\left(\frac{w_{jk}X[n,k]}{A[n,j]}\right)</math> where <math>A[n,j]</math> are the linear frequency spectrum values which normalize each band to make each band <math>j</math> weighted by <math>w_{jk}</math> “probability density function-like”. The mean and standard deviation of each feature, before and after converting to dB, was taken over longer frame sizes, as well as the average entropy measure and the entropy deviation normalized by its mean value.

The Bayesian Information Criteria was used to segment the recordings into episodes, which can compare different amounts of data with different numbers of parameters and compares every possible segmentation window until a boundary is found. BIC makes a decision based on the model complexity, as measured by the number of model parameters. Given data <math>X = \{x_i : i = 1, \ldots, N\}</math> which is being modeled by <math>M</math> with <math>\#(M)</math> parameters which produces <math>X</math> with likelihood <math>L(X,M)</math> under its best parameterization, <math>\operatorname{BIC}(X,M) = \log(L(X \mid M)) - \frac{\lambda}{2}\#(M)\log(N)</math> where <math>\lambda</math> determines the weight applied to the model parameters. BIC models a sequence of feature vectors <math>X = \{x_i \in \Re^d : i = 1, \ldots, N\}</math> as independent draws from either one or two multivariate Gaussian distributions, corresponding to hypotheses <math>H_0</math> and <math>H_1</math>. The difference in scores for these two hypothesis is calculated as <math>\Delta \operatorname{BIC}(t) = \log\left(\frac{L(X\mid H_0)}{L(X\mid H_1)}\right)-\frac{\lambda}{2}\frac{d^2 + 3d}{2}\log(N)</math>. When <math>\Delta\operatorname{BIC}(t) > 0</math>, we place a segment boundary at <math>t</math> and begin a new search with a reset window size <math>N</math>.

Certain activities will likely be repeated from day to day, so segmented episodes are clustered to group recordings of these activities. Spectral clustering is used, which calculates a matrix of affinities between each segment using the symmetrized KL divergence between single diagonal-covariance Gaussian models fit to the segments' feature frames, given by <math>\operatorname{D_{KLS}}(i, j) = \frac{1}{2}1)</math> where <math>\Sigma_i</math> is the unbiased estimate of the feature covariance within segment <math>i</math> and <math>\mu_i</math> is the vector of per-dimension means for the segment. <math>\operatorname{D_{KLS}}</math> becomes larger as the means and covariances of two segments become more distinct, so the distance can be converted to an affinity by <math>a_{ij} = e^{-\operatorname{D_{KLS}}(i,j)^2/2\sigma^2}</math> where <math>\sigma</math> is a tunable parameter which controls the number and size of clusters. The eigenvectors of the affinity matrix are calculated and used to choose <math>K</math> clusters; <math>K</math> is chosen based on the value for which the GMM used for the final clustering has the best BIC score.


Ground truth data for the segmentation and clustering was generated from 62 hours of audio, annotated into 139 segments each in one of 16 broad classes. Evaluation was done by adjusting the <math>\sigma</math> parameter so that one false boundary limit was labeled every 50 minutes and measuring the number of true boundaries labeled within three minutes of the correct time. The auditory spectrum performed best in most cases, particularly when converted to dB. Using PCA to compress the best features, a sensitivity of .874 could be obtained. 15 clusters of these segments were automatically found via a one-to-one mapping of the most similar automatic to annotated clusters; one ground truth class had no automatic cluster and five had no correctly labeled frames, giving a per-frame accuracy of 67.3%. Clustering without segmentation gave a 42.7% accuracy. A search over window lengths was performed, and it was found that shorter window lengths performed better; in an exhaustive parameter grid search, three second time frames performed best. Using different features for clustering raised accuracy to up to 82.8 percent. The segmentation algorithm would be negatively affected by shorter window lengths. An interface was built which shows the derived segments alongside a spectral representation of the waveform, which facilitates finding the desired audio. The speech content of the audio is a potentially highly useful aspect which is unfortunately rife with regulations and privacy concerns. Techniques for securing personal archives should be developed to ensure they don't get into the wrong hands.

1) \mu_i-\mu_j)^\prime(\Sigma_i^{-1} + \Sigma_j^{-1})(\mu_i-\mu_j) + \operatorname{tr}(\Sigma_i^{-1}\Sigma_j + \Sigma_j^{-1}\Sigma_i - 2I
accessing_minimal-impact_personal_audio_archives.txt · Last modified: 2015/12/17 21:59 (external edit)