User Tools

Site Tools


Evaluation Methods for Musical Audio Beat Tracking Algorithms

Authors Matthew E.P. Davies, Norberto Degara and Mark D. Plumbley
Publication Info Retrieved here
Retrieval date 8/29/11

Compares a variety of techniques used to evaluate beat tracking algorithm performance and proposes a new technique, the beat histogram.

Evaluation requirements

A beat tracker automatically recovers a sequence of times where a person might tap their foot, which is useful for segmentation, interaction, and content-based effects. For a given beat tracking algorithm, we need musical data with ground truth annotations (which can be hard to distribute, time-consuming, and subjective) and a way to compare algorithm output to the ground truth. This paper focuses on the effectiveness of an evaluation technique in terms of its ability to measure information gain.

Evaluation difficulty

Objective evaluation techniques directly compare the beat locations to ground-truth, normally with a tolerance window (~50-70ms or ~20%) due to the precision of hand-annotation. The ground truth also may reflect the tap locations (production) or, more specifically, the actual beat locations (perception). It is also common for different ground-truth annotations of the same song to be off-beat (out of phase) compared to one another, or at a different metric level (double or half tempo).

Subjective evaluation can avoid some of these difficulties, but is relatively very time-consuming (real-time at best) and the meaning of “correct” annotation can be ambiguous, making them impractical and non-scientific (non-reproducible).

Evaluation techniques

Terminology: <math>\gam_b</math> is the <math>b^{th}</math> of <math>B</math> beats produced by the beat tracker, <math>a_j</math> is the <math>j</math>th of <math>J</math> ground truth annotations, and <math>\Del_b = \gam_b - \gam_{b-1}</math> and <math>\Del_j = a_j - a_{j-1}</math> are the inter-beat- and inter-annotation intervals respectively.

F-measure is defined as <math>F = \frac{2c}{2c + f^+ + f^-}</math> where <math>c</math> is the number of true positives (correct detection within a tolerance window), <math>f^+</math> is the number of false negatives (missed detection) and <math>f^+</math> is the number of false positives (extra detections). Note that off-beat annotations relative to ground truth will give <math>F=0</math> and tapping above or below the metric level will be penalized.

Cemgil used a Gaussian error function which penalizes each beat location based on its distance from the closest annotation: <math>W(\gam_b-a_j) = e^{-(\gam_b-a_j)^2/2\sig^2}</math> where the standard deviation is <math>\sig = </math> 40 ms. It is similar to a tolerance window, except it provides a continuous accuracy score. The overall score is <math>\textrm{Cem}_{\textrm{acc}} = \frac{\sum_j \textrm{max}_b[W(\gam_b-a_j)]}{(B+J)/2}</math>.

McKinney defines the P-score as the cross-correlation between an impulse train with impulses at annotated beat locations and another train with impulses at detected beat locations, quantized to a 10 ms resolution. All events occurring in the first five seconds are removed. The cross-correlation is taken over a small window <math>w = 0.2\textrm{med}(\Del_j)</math> and the score is defined as <math>\textrm{PScore} = \frac{\sum_w T_a *_{(w)} T_\gam}{\textrm{max}(J, B)}</math>.

Goto and Muraoka used multiple statistics of a beat error sequence <math>\zet_j = \left\{ \frac{|\gam_b-a_j|}{\Del^*_j} \textrm{ when } a_j - \Del_{j-1}/2 \le \gam_b \lt a_j + \Del_j/2 \\1\;\textrm{ otherwise.}\right.</math> where <math>\Del^*_k = \left\{\frac{\Del_j}{2} \textrm{ when } \gam_b \ge a_j
\frac{\Del_{j-1}}{2} \textrm{ when } \gam_b < a_j\right.</math> such that error is normalized to half the inter-annotation width. A sub-sequence <math>\zet_k</math> is extracted under the condition that <math>\zet_j < 0.35</math>, and the beat tracking is “correct” if <math>k_s/k_e < 3/4</math> where <math>k_s</math> and <math>k_e</math> are the first and last elements of <math>\zet_k</math> and if <math>\textrm{mean}(\zet_k) < 0.2</math> and <math>\textrm{std}(\zet_k) < 0.2</math>.

Hainsworth and Klapuri developed non-binary continuity-based accuracy evaluation techniques, which find sequences of beats which are correct, or where <math>a_j - \th\Del_j < \gam_b < a_j + \th\Del_j</math>, <math>a_{j-1} - \th\Del_{j-1} < \gam_{b-1} < a_{j-1} + \th\Del_{j-1}</math> and <math>(1 - \th)\Del_j < \Del_b < (1 + \th)\Del_j</math>, where <math>\th = 0.175</math>. The <math>M</math> continuous segments that result are referred to as <math>\Up_m</math>, and we can calculate the accuracy scores <math>\textrm{CML}_c = \frac{\textrm{max}(\Up)}{J}</math> and <math>\textrm{CML}_t = \frac{\sum_{m=1}^M \Up_m}{J}</math>. The annotated labels can be re-sampled to half or double the metric level to get <math>\textrm{AML}_c</math> and <math>\textrm{AML}_t</math>.

One drawback to all of these approaches (except Cemgil's) is that they use tolerance windows to make a binary decision about a correct vs. incorrect beat. Furthermore, without resampling, none of them address the ambiguity of metrical level and penalize it by generally giving it an accuracy score of 0. A beat tracker should ideally penalize off-beat tracking much less than completely sporadic/random tracking or in cases where the tempo drifts.

Information gain evaluation

The proposed method finds the normalized error of each beat such that a correct beat has an error of 0 and an out-of-phase beat has an error of about .5, so that the timing error of each beat is given by <math>\zeta(j, q) = \left\{\frac{\gamma_q-a_j}{\Delta^*_{j-1}} \textrm{ when } \gamma_q \le a_j
\frac{\gamma_q - a_j}{\Delta^*_j} \textrm{ when } \gamma_q > a_j\right.</math>, where <math>\Delta^*_{j-1} = \frac{a_j - a_{j-1}}{2}</math> and <math>\gamma_q = \gamma_b \;\;\; : \;\;\; a_j - \Delta^*_{j-1} \le \gamma_b < a_j + \Delta^*_j</math> (the set of beats within half an annotation interval of the current annotations). This sequence is called <math>\zeta_{\gamma|a}</math>, the error of the beats given the annotations; we can also calculate <math>\zeta_{a|\gamma}</math>, the error of the annotations given the beats. We can then calculate a histogram of beat error; empirically 41 bins was found to be a good number. The histogram should be normalized such that <math>p_{\zeta}(z_k)</math> is the probability that a given beat error will fall into bin <math>k</math>.

Because a tracker at a higher metric level would result in a multimodal histogram (peaks at -.5 and .5), an accuracy metric could study the “peakiness” of the histogram. The information gain can be used to measure the distance between the error histogram and the worse-case uniform probability distribution. Using the Kulback-Liebler divergence, we have <math>D = \sum_{k=1}^{K} p_{\zeta}(z_k) \log_2(p_{\zeta}(z_k)) + \log_2(K)</math>. <math>D</math> is bounded by 0 and <math>\log_2(K)</math>. An “off-beat” tracker will score as high as an on-beat because it also has one big peak, and a higher metrical level will have a high information gain with two or more peaks.


To compare evaluation methods, five beat tracking algorithms were run on The Beatle's music catalog, one being a deterministic tracker which treated each song as fixed in tempo at 120 bpm. The calculated accuracy scores confirm the assertions that some evaluation techniques will over-penalize off-beat tracking or tracking at different metrical levels, and under-penalize deterministic (or random) tracking. By systematically offsetting the tracked beat locations, it is also suggested that certain accuracy metrics are more or less sensitive to offset (depending on their window size); in particular, the information gain-based technique is not sensitive at all.

evaluation_methods_for_musical_audio_beat_tracking_algorithms.txt · Last modified: 2015/12/17 21:59 (external edit)