|Presenter||Dan Ellis, Matt McVicar,|
Summaries of each presentation given at NEMISIG 2013.
Over the past 10 years, LabROSA has published in the Music realm a lot, but last year there was only one publiciation. This is in part because it's hard to get funding for Music. However, there are a good handful of people in the lab who are working on Music stuff.
Zhuo Chen has been developing a parametric source-note model, where each note from each instrument have templates developed. Each template has some encoding of time evolution of the note (as each part of the note is different), making each template parametric, with parameters for how quickly you move between each state and the amplitude envelope. The codebooks and parameters need to be learned, currently using EM.
Thierry Bertin-Mahieux worked on a large-scale cover song recognition system (over the Million Song Dataset) where the beat-synchronous chromagram is calculated and its 2D Fourier magnitude transform is taken. This representation is more robust to (musical) scale and time offsets. The representation for the entire song is calculated based on principal components of the 2DFTM. This has given the best accuracy for large-scale datasets. The Million Song Dataset has not changed much. The Million Song Dataset challenge engaged a new group of participants (no one from ISMIR) to try to try to predict listenership; collaborative filtering performed best.
Brian McFee has been working on a jazz database project in collaboration for the Center for Jazz Studies at Columbia. Most of the tools used in MIR are genre-specific, so part of this project is developing jazz-capable algorithms (where the rhythm is looser and many instruments do not play on the beat). Part of the project is figuring out what the jazz group needs. One example is trying to determine the rhythmic patterns of a piece of music as described by a jazz music analyst. Cross-correlations of the onset envelopes in small bands are taken with the envelope for the entire band to visualize the synchronization in each frequency band.
Future research directions need to move on from music recommendation. Separating music into events, determining the fine-scale structure, and figuring out which music is “good” and “bad” (globally and per-user) would all be useful work.
Lyrics are a fundamentally important part of music, but are not studied in depth in the MIR community - instead, general harmony/timbre features are extracted, even though the lyrical content is hugely important for perception of the music. Being given time-aligned phoneme locations for a song could help with various tasks, including recommendation and cover song detection (where lyrics tend to be the same). There's been a lot of work done on automatic speech recognition; it's pretty sophisticated and can deal with homophones pretty well. However, just given the vocals (separated cleanly from the original track), state-of-the-art systems work very poorly - 75% word error rate. This is largely because speech and singing is quire different. Speech tends to be cleaner, with a “spoken” timing - in singing, there may be reverberation and compression, longer vowels, and more “flatness” over frequency (for tones being sung). Trying to flatten the a cappella's frequency and time-scale it to match the spoken word did not help much.
Fortunately, a lot of the work has been done already because many popular songs have lyric transcriptions available on the internet. However, there's no time-alignment information which would be very useful for many MIR tasks. To align, the proposed system first matches an a cappella to an original track using fingerprinting which retrieves metadata, then retrieves lyrics from the web and synthesizes them as speech and does DTW to align them to the audio to get time-stamped lyrics. Hopefully, a database for 1000 tracks will be generated with a cappellas and time-stamped phonemes/words. This could then be used to train speech recognition systems specifically for singing.
Dartmouth has graduate programs in digital music and computer science in the context of music.
In one experiment, algorithmically generated music is created (with five parameters to control the music generated) by subjects and then played back to other subjects and the listeners can determine the intended emotion, even outside of western communities. The same emotions could also be detected if the parameters were mapped to a visual representation.
Another research direction involves decoding brain images taken when someone is listening to music - trying to understand the relationship between audio features and the way the brain responds to music. Can we do MIR with brain images - classifying brain images elicited by music? Are the features used “validated” by neural features? Can we decode the brain images back to sound? fMRI images were taken, especially of the auditory cortex. The data from the fMRI were fed into ML classifiers and above-chance results were obtained. Regression between auditory features and the brain images was attempted such that the brain image is multiplied by some matrix to reconstruct the audio features.
A related question is whether we can reconstruct the music itself from the brain images, and whether you can compose music just from having your brain imaged.
Given data which is high-dimensional and noisy and time-sequenced, can we come up with a representation which encodes the important information, with a manageable dimension and multi-scale. Given the Beatles dataset, which consists of MFCCs for each Beatles song. Pair-wise squared Euclidian distance is calculated. To create a multiscale representation, single-scale representations are calculated and concatenated in a sane way. Diagonals of near-zero values in the self-similarity matrix denote repetitions, so in each song repetitions can be determined. This is summarized as “mini-patterns” which are short lists of numbers which encode the repetition structure. This representation can be used to construct a network of similarity between all of the pieces in order to find cover songs.
Founded in 2008, MARL focuses on Immersive Audio, Music Cognition, Computer Music, and Music Informatics. As a research focus - sometimes human data is not enough for music, and you need to to content-based analysis.
In chord recognition, typically features are extracted (chroma) and cleaned up, then some kind of pattern recognition is done to find the chord sequence. Research at MARL focuses on each of these steps in detail, determining which are the best features, filtering, etc. Moving average, median, and recurrence-based filtering techniques are used for smoothing the chroma features, ideally without blurring boundaries. Pattern matching techniques include templates, gaussians, GMMs, and HMMs. When you have good pre-filtering, the benefits of one approach over another is not huge. Maximum likelihood error does not always give you the best optimization criteria, instead it would be good to optimize over frame error.
Chord recognition is only as good as the features/audio allows it to be. In particular, structure and order is really important to features - bag of frames tends to be very insufficient because you can have the same mean and variance of feature vectors with completely nonsense structures. Deep structures are well-suited to encode structure and patterns. Convolutional neural networks take advantage of local patterns in the data. The output is then easily interpretable. This may be well-extended with unsupervised learning techniques.
Research has also been done on interactive music systems which train finite state transducers on MIDI data.
If someone listens to music, can we measure physiological changes and determine characteristics of the music? Detrended fluctuation analysis (a type of time-series analysis for non-stationary time series to measure relative chaotic-ness at various scales) has been used to look at local complexity in music, and has been found to be able to estimate the danceability because at a smaller time scale classical music is very disorderly, whereas for repetitive music it is orderly even at small time scales. Using 25 tracks from many genres, the DFA score was computed and compared to a “grooviness” index, and had a pretty good correlation. Applying DFA to physiological signals seemed to capture similar information. Instead of looking at the mean DFA, the DFA score for short time scales correlated very well with the rating of grooviness. Whether the listener enjoyed the piece affected the grooviness score's relation to the stimulus data.
We'd like to automatically identify the sections of a song, by finding the boundaries of the sections and cluster them by their similarity. Chord progression and timbre are good indications of the section of the music. Labeling a piece of music is a subjective task, based in part on the “feel”. To replicate this algorithmically, harmonic beat synchronous features were extracted via The Echo Nest, with a median filter used to smooth the chromagram with sharper edges. A self-similarity matrix was calculated with a correlation distance. Taking the power log expansion of the matrix gave better contrast. The self-similarity matrix is then approximated as the product of two other non-negative matrices via NMF. Convex NMF was also used, were one of the matrices in the product is expanded to the product of the input matrix and a non-negative matrix whose columns sum to one. Using convex NMF makes the output matrix clearer by making one the matrices have the sections on rows instead of diagonally. The convex NMF solution also converges to the typical value more quickly. Given the CNMF solution, k-means is used on the decomposition matrices with two clusters to find boundaries. Then, the diagonal of the decomposed matrix is analyzed to find similar sections. Performance was very good for CNMF.
The MET-lab focuses on Machine listening and MIR, new music interfaces, robots, and music production technologies. One project matches real-time program notes to classical music, with a visualization for a “map” to the themes of the piece. Another is the magnetic resonator piano, which is a drop-in system for any acoustic piano which allows for electromagnetic resonation of the strings of the piano, allowing for infinite sustain. They also work on time-varying music emotion recognition, which maps emotions on scales of valence and arousal and makes a probabilistic map which represents the distribution of the emotions and how they evolve over time. These maps vary over the course of the piece. The lab purchased four robots which have a common platform with robots at other schools so that technology can be shared. Drexel has collaborated with the Philadelphia Science Festival to create visualizations for a live jazz music performance. They have also worked on educational games.
A musical performance is the interpretation by a performer of a score (using some controlling gestures applied to an instrument) to create sound. If the score is mapped directly to sound, we are doing sound synthesis; the other direction is score transcription. Both paradigms ignore the “gesture” step. There has been work on mapping the score to gestures, and mapping gestures to sound (which is related to physical modeling). Determining the gestures from sound is called indirect acquisition, as opposed to direct acquisition from sensors. Using sensors is accurate but is intrusive and complex. Indirect acquisition can be done offline and is simple and cheap to do, but is hard to make robust and accurate algorithms for. For indirect acquisition, there are physically, perceptually informed approaches, and also a more “hands-off” data-mining based approach.
The discussed work focuses on violin, looking at the gestures of string, finter position, force, velocity, bow-bridge distance and bow tilt. Given some acquired data an indirect acquisition system can be trained. Various sensors (pickups, Polhemus motion sensors) were used to record the desired gestures. Many common acoustic features were extracted from the corresponding audio. The pitch was used to determine finter position, spectral features were used to determine the rest of the gestures for each string using MLP neural networks. The string prediction would swap very quickly so some hysteresis was used to smooth the output, and matched the actual string pretty well. The overall accuracy was good.