|Date||10/28/14 - 10/31/14|
Summaries of each presentation given at ISMIR 2014.
There are a variety of techniques used to play the guitar (e.g. hammer-ons). Professional guitarists can hear these techniques without seeing how the guitar is actually being played. Automatically classifying this technique would be useful, but it's a subtle signal and different tonalities confound comparisons. They compiled a large dataset with a variety of techniques (normal, muted, vibrato, pull-off, hammer-on, etc) using various levels of distortion, reverb, delay and chorus. They extracted various features, including phase derivatives, log/rooted cepstrum, etc. Features were computed around the onset of each note (found via basic spectral flux) and sparse coded into frame-level codewords whose mean are computed each recording. They trained a linear SVM with cross-validation and tested it on a single real-world example. The features extracted using sparse coding performed better than the raw features. The best F-score achieved was about 60%. Using the cepstral and phase-based features reduced confusion.
In MIR, big challenges for large-scale content-based shared tasks include lack of lots of data, formal definitions of ground truth/the problem, reliability of human-annotated ground data, share-ability of data, reproducibility, etc. This work proposes the Kiki-Bouba challenge, which involves the development of a system which can discriminate, identify, recognize, and imitate the Aristolean categories of music (Kiki and Bouba). The sample space is algorithmically generated music (so the data is limitless and free of copyright) with perfect ground truth (all music falls into one of the two classes). The first task, discrimination, seeks to determine that there are/approximate the two classes given no labels. The classification problem classifies as one of the two classes. Recognition seeks to find high-level concepts which describe each class. Imitation seeks to be able to actually compose music in this form. Kiki music is more accelerated and transient; bouba music is slower with more chords. Supercollider data is available to generate both types of music. Using such a simple problem, whether the system is actually exploiting musical factors or not. Having solved the simple problem, more and more complex problems can be presented.
Automatic music transcription includes a variety of tasks, including instrument transcription, polyphonic F0 estimation, drum transcription, melody estimation, and even beat tracking. A more constrained definition is the transformation of a musical audio signal into a score (MIDI file) which allows resynthesis of the music without error in timing, pitch, or instrumentation. This covers pitch, onset duration, instrument information. A good transcription system would allow for symbolic queries/analysis to be made about music audio files. It could also aid in music signal processing tasks such as source separation. It is one of the most difficult problems in music analysis because there are many concurrent sources, potentially with masking/overlapping (similar to the cocktail party problem). Typically simplifications to the problem are made to make practical progress, to limit polyphony/avoid overlap, such as piano or drum transcription, or excluding drums, or only instrumental music. In most cases, both expert-based and machine learning-based systems have been made which work well. PLCA and NMF-based approaches have been used, with increasing complexity over the years, as have neural networks with increasingly promising results. At IRCAM, the audio2note project created a system for produce a multiple-f0 estimation with onsets with no parameters and approximately realtime processing. The system performs peak tracking and noise removal to get rid of spurious sinusoid peaks. The system doesn't work perfectly, but the output can be corrected and used in compositions.
Many transcription systems use spectrogram factorization (PLCA, NMF). The intention is to obtain templates for each note, and activations for the templates which are interpreted as when note is each active. Sometimes, the templates may not match the input sources, even though this is helpful. This work attempts to do multiple-instrument polyphonic music transcription by first doing a “conservative” polyphonic transcription, then adapting the dictionary to the input and re-transcribing. The “conservative” step is a PLCA transcription on constant-Q spectrogram, using an E-M algorithm. To adapt the templates, spectrograms which correspond to each pitch are extracted. Using these frames, new templates are created, and the information update is expanded to nearby templates. Finally, the recording is re-transcribed using the new templates with some output processing to cleanly binarize the output. This system can also be expanded to multi-instrument settings. The system was evaluated on single instrument and bassoon-violin duets with improved results when template adaptation was used.
Once a binary piano roll representation has been obtained by an automatic transcription system, the note start and end times need to be estimated in order to obtain a score. However, spurious notes can result from consistent octave/harmonic errors. This work generates a number of transcription candidates, and evaluates each candidate in terms of how well it fits the audio as a whole. The candidates are sampling in each chunk, according to their likelihood, with different densities across candidates with the polyphony constraint satisfied. 100 candidates are generated. The note likelihood is computes as the geometric mean of the single-pitch likelihood (estimated by the transcription system) of all pitches which contribute to that note. The multi-pitch likelihood is defined as the match between spectral peaks and harmonics of all pitches.
Most drum transcription systems either segment the audio recording into beats or sub-beats and then classify each subsegment (as e.g. hi hat + snare, bass drum) or they attempt to separately estimate the activations of each drum and transcribe based on that. However, a small number of drum patterns covers a large proportion of drum durations, suggesting that maybe drum transcription could be done on a bar level instead. In this setting, the resulting drum pattern is always valid and it's a simple way to inject musical domain knowledge, but beat and bar times are required beforehand and you can only classify patterns in your database. MFCCs are extracting over overlapping bars, with MFCCs synchronized on a sub-beat level. The resulting feature vectors are classified using an SVM, and the resulting class can be used to retrieve the corresponding drum pattern. The fifty most common drum patterns were used in 4/4 with kick, snare, hi-hat, open hi-hat, ride and crash drums. These drum patterns were synthesized into 600 “songs”, with artificial humanization synthesized with fluidsynth. Training and testing on synthesized data yielded 80% accuracy, but would vary according to the drum kit used for synthesis. The use of overlapping bars improved frequency by a small amount, probably because cymbals bleed into the following bar. Even with 80% accuracy, failures are “graceful” because the only allowable patterns are musically valid, resulting in a high F-measure on drum events. When using real data for testing and artificial data for training on a small test set, the resulting accuracy was much lower.
Given a query musical phrase, this work seeks to find the phrase in a piece of music, without access to the symbolic score, and without requiring that the query is the main melody. This is in contrast to query by humming systems (requires main melody) and fingerprinting (query must be an excerpt of the musical piece itself). The phrase can be buried in polyphonic music signals. It would be great to transcribe the notes of the piece and the query but transcription doesn't work well enough. However, some form of source separation is needed to extract relevant partials, but explicit source separation is not needed. Assuming taht the query is present in the piece, using Bayesian NMF the query phrase can be added to the basis spectrum and the resulting activation can be correlated against the activation in the NMF of the query phrase. The maxima of the correlation corresponds to where the phrase appears. Non-parametric NMF allows for an arbitrary number of bases. To avoid the problem that the query phrase activation is not used in the whole piece's NMF decomposition, a different prior is used for the snippet dictionary. The system was tested with exact query matches, same query different instrument, and same instrument but played faster. The proposed method performs better than conventional features.
Audio alignment seeks to match up common times across different recordings of the same piece of music. This is often done either by finding a path through a similarity matrix between the two pieces, or by creating a generative model which treats the audio signal as a sequence of states, where the states must continue in a left-to-right order. The latter approach is dependent on how well the states are represented by their associated (chroma) feature vectors. It also assumes that the states are independent, while in music they are in practice extremely dependent due to repetition. So, instead of requiring a left-to-right nature, an each state is assumed to be from another underlying ergodic (cyclical) HMM. Also, each state has a unique representative duration probability density function to keep the states from being used for too short or too long. The proposed system is analogous to infering a latent score from the music, from with the left-to-right HMM is constructed from. The resulting model works better than a pure left-to-right model but works about as well as DTW over a similarity matrix.
Song structure analysis usually means finding the boundaries between thematic sections and/or labeling the repeated sections. However, the representation and visualization of the structure is also important. A good representation of structure is the infinite jukebox, which creates a graph of beats in the song and randomly either moves to the next beat or jumps to another similar beat in the song. The (sparse) non-beat-to-beat jumps in the graph encode similarity structure. In this work, a graph is constructed over the graph, the graph is partitioned to recover structure, and the partition size can be varied to expose multi-level structure. The graph is constructed by creating MFCC-weighted connections between successive beats, then are linked to other vertices by their nearest neighbors (in CQT space) also weighted by feature similarity. The final graph is constructed by optimizing a linear mix of the local graph to the repetition graph, where the weighted combination is optimized so that a local move is about as likely as a repetition move. Once the graph is constructed, it is partitioned using spectral clustering to obtain the normalized Laplacian whose eigenvectors encode component membership for each beat and are used to identify the individual repeated components. The components can be used to create low-rank reconstructions of the graph, where the rank corresponds to the granularity of the segmentation. Estimating the number of segments works slightly worse than a baseline state-of-the art; using an oracle number of segments works slightly better.
Pattern discovery attempts to find all patterns and their occurrences within a given piece, where a pattern is a (harmonic) musical idea/motive/theme/section (potentially overlapping, not necessarily contiguous) which repeats (not necessarily exactly) at least once. The proposed approach uses music segmentation techniques (on audio signals) to obtain the most repeated parts of the song. First, a quarder-beat-synchronous chromagram is computed. Second, a key-invariant similarity matrix is computed (by testing all key rotations and choosing the most similar). Third, all possible repeated segments (diagonal paths in the similarity matrix) are found by assigning a local “path strength” score. Finally, the patterns are clustered by finding all patterns whose start and end times are within a time window. This approach could also be adapted to symbolic or monophonic pattern discovery tasks. The algorithm was tested on a small testbed and was found to achieve state of the arts results for most metrics.
Structure estimation algorithms often first segment the piece, then label the segments. Segmentation is often done by computing the novelty, e.g. by computing the amount that the immediate future is different from the immediate past at a given point in time. The novelty computation is often designed by hand. The proposed work learns a novelty estimator using a convolutional neural network with a single output (the novelty at a point in time). It was trained on the SALAMI dataset. Because there are relatively few boundary frames, the positive examples were duplicated and the frames neighboring a positive example were treated as a downweighted positive example. The resulting system was state of the art, with a particularly large improvement for detecting boundaries with very fine accuracy.
Meter inference attempts to determine the type of meter (layers of metric hierarchy) and align its layers to a music audio signal, which can be performed differently by different listeners (especially depending on cultural background). This work attempts to infer meter, beat, tempo, and downbeat using a unified model across different types of music using the “bar-pointer model”, where at each point in time the position in the bar is incremented until it reaches a maximum value (the meter length) where the amount of incrementation (tempo) can change within the bar. These parameters are treated as hidden variables, and are inferred according to a transition model. The model was tested in a large corpus of metrically varied pieces, both in the setting where the rhythm was known and where it wasn't, and was able to infer the meter well.
Given a sound mixture, it often contains both harmonic and percussive sounds; the goal here is to separate these parts. However, there is often a third noisy component, and with most techniques the noise gets included in part in both. This work adds a “residual” component to cover noise-like stuff. It extends the common approach of computing the horizontal and vertical medians of the spectrograms and using them as masks by adding a “separation factor” which says that a time-frequency bin should only go to a harmonic or percussive component if it is at least a constant times bigger than it is in the other representation. This leaves some residual bins which are treated as noise. Because the beta effects both the harmonic and percussive components in the same way, an iterative process which first separates out the harmonic component, then separates out the percussive component can be performed to have separate control over the threshold for harmonic and percussive.
This work uses fusion strategies to combine different onset detectors, find good parameter settings, and improve performance on different musical styles. Fusion can be done by utilizing characteristics of the onset detection systems, or it can be done by just computing a linear combination of onset detection functions, or it can be done by merging/deleting onsets based on the output of different onset algorithms. Peak picking is performed by smoothing and thresholding, fitting a polynomial, and backtracking to choose an onset location before the energy maximum. All parameters were tuned. The best performing onset detectors were fused ones; they did not always help.
Beat trackers are evaluated to determine how well they work. They can be evaluated by using a listening test (confirming whether it's correct) or by comparing against a ground truth. Listening, compared to objective evaluation, is potentially more valid, but is may not be consistent/reproducible and expensive. A variety of approaches have been proposed for computing the accuracy, which vary in their treatment of the importance of temporal accuracy, beat octave/phase, continuity. We don't know which one is the best, and “best” may depend on the application. To try to determine how well the evaluation methods reflect true “correctness”, beat trackers were run on audio (and off-beat, double, and half-tempo versions were created), their accuracy was computed with different metrics, and their subjective validity was reported by 22 participants. The resulting objective accuracies were then compared to the subjective accuracies. For most evaluation methods, the correlation increases if alternative phase/octaves are allowed. All the metrics had parameters, however, so all the subjective/objective correlations were re-run against the metrics with swept parameter settings. For most methods, the correlation increases when using stricter evaluation (which means lower scores).