This document describes the difficulties inherent in evaluating a computational beat tracker, and suggests a technique for obtaining some accuracy values which may be used as a baseline.
When writing a beat detection algorithm, it's natural to use some kind of evaluation technique for determining the effectiveness of the algorithm. Typically this evaluation involves algorithmically comparing the generated beat locations to a list of human-annotated locations. A comprehensive list of such evaluation algorithms was compiled by Davies et al and summarized here. This is also the set of evaluations implemented in the Beat Evaluation Toolbox and used in the MIREX audio beat tracking task. All of these algorithms generate some number (typically a continuous value from 0 to 1) which is meant to indicate when a beat tracking algorithm is relatively close or far away from the hand-annotated data.
One clear difficulty with this approach is that the algorithmically generated beat locations are being compared (objectively) to subjectively generated beat locations. In other words, the specific locations of the beats depend a great deal on the person annotating beats - there is no “absolute truth” for this task. The annotator may, for example, tap beats at half or twice the tempo when compared with another annotator, or may tap only the “off-beats”. This makes interpreting the accuracy scores presented by an evaluation technique ambiguous.
This suggests that a better way to determine the effectiveness of a beat tracker is not to test whether it is achieving 100% accuracy across the board, but instead how well it is doing compared to the general ambiguity of human annotation. Specifically, it is worth knowing how well humans agree when hand-annotating beat locations. A simple way to test this would be to choose a song and ask a group of people to (separately) annotate the beat locations. If their locations match very closely, and a hypothetical beat tracker also produces these beat locations, we can begin to deduce that the beat tracker is effective; on the other hand, if the beat tracker produces wildly different locations, we can assume that the beat tracker is failing.
An important aspect of this hypothetical situation is whether or not the locations match closely. If a group of people annotate the beat locations for a given song, and their locations do not generally match up well, it is likely that the rhythmic structure of the song is ambiguous, and perhaps it is unreasonable to expect a computer to reliably reproduce the locations. This leads to the more specific question - how well does a human annotator score (using some evaluation technique), and what score indicates that a beat tracker is being effective?
A simple way to investigate this question would be to evaluate the beat locations annotated by a human to those (independently) annotated by another human. In order to do so effectively, it would be useful to have a group of (musically dissimilar) songs and a group of people who (separately) have annotated beat locations for each song. Fortunately, the MIREX training dataset satisfies these criteria; it consists of 30 songs each accompanied by 40 sets of beat locations, each annotated by a separate person. To obtain some baseline scores which may be compared against a beat tracker's scores, we can evaluate each person's annotation for each song against every other person's annotation and average the resulting accuracy across all people and all songs. In pseudocode, this may look like:
# For all songs with annotations for song in songs: # For all human-generated annotations of this song for annotations in song: # Compare this annotation to everyone else's annotations for compareAnnotations in song: # Don't want to compare the annotations to itself... if compareAnnotations == annotations: continue # Get the accuracy scores scores.append( getAccuracy( annotations, compareAnnotations ) ) # Get this song's mean scores, and store them songScores.append( mean( scores ) ) # The baseline score - mean of all annotations across all songs baselineScores = mean( songScores )
The evaluation techniques used are similar to those in the Beat Evaluation Toolbox, but differ as follows:
As a result, all of the evaluation techniques result in a score between 0 and 1. All vary continuously in this range, with the exception of the Goto score, which is either 1 or 0 depending on some specific criteria. When run on the entire MIREX training dataset, this experiment results in the following mean scores:
These scores are unexpectedly low. If we are to interpret the scores as “accuracy”, we would hope that all humans generally agree on beat locations, so compared to one another each person's annotations would be highly “accurate”. However, the average when comparing two people's scores to one another uniformly results in a score significantly lower than 1. This suggests that humans disagree a great deal on beat locations, and furthermore, that we cannot necessarily hope for a beat tracker to achieve scores near 1 across the board. Additionally, it is important to note that these do effectively measure how well two sets of annotated beat locations match up, however they do not measure any kind of universal “accuracy” or truth (which is to be expected thanks to the ambiguity of beat locations!). Finally, the scores listed above could potentially be used as a vague guideline - if a computerized beat tracker is reliably achieving higher scores, we can assume that it is performing well (like a very agreeable human).
In 2010, the top-ranking beat tracking algorithm was based on bidirectional long short-term neural networks. On the MIREX dataset (not the MIREX training dataset used above, but a similar “secret” dataset used for testing) it achieved the following scores:
Note that this algorithm (close to state of the art as of writing) is uniformly close, but lower, to the baseline scores listed above, with the exception of the Goto and AMLc scores. This suggests that there is still some room for improvement in beat tracking algorithms, although they are quite close to replicating the behavior of an average human.