User Tools

Site Tools


MIDI Dataset

We have a large dataset of MIDI files, which may or may not have good audio data. We have good features for a million songs as part of the MSD. We'd like to match up the MIDI files to their corresponding entries in the MSD, and extract useful ground truth information from the MIDI files. The difficulty is the scale - we have tons of MIDI files with tons of possible matches in the MSD, so we need to do things efficiently.


We have about 150,000 MIDI files downloaded from the internet. Can we also find any large collections of MusicXML or other symbolic formats?

MIDI Metadata

The MIDI files are sometimes named according to what song they are transcriptions of, but not always. MIDI files also have a way that they can be annotated so that you can specify an artist, song name, etc. We need good ways to extract all of the metadata we can find in each MIDI file so that we can get a head start on possible matches. EchoNest has some methods for extracting this information, see eg See also Google Refine and freebase . For each MIDI file, we may be able to get metadata from:

  • The filename
  • Are there any tags in the MIDI file and how do we extract them?
  • The MIDI file's length (is it too short to help?)
  • The MIDI file tempo

MIDI Matching

Once we have a list of candidates for some song in the MSD, we can actually try to determine whether it's a match using the EchoNest features. This idea is really similar to cover song detection. Cover song detection is often done by computing beat-synchronous chromagrams. The idea is that if you have two performances of the same song, they may differ in instrumentation or style but they should have the same underlying melodic/harmonic structure. The same thing is true for our MIDI files. So, given a MIDI file, we need to be able to generate beat-synchronous chromagrams. This will involve

  • Beat tracking the MIDI file (there may be pre-existing implementations of this)
  • Summarizes the notes in each beat as a chromagram
  • Match, in some really efficient way, the MIDI-generated chromagram and the chromgram from the Echo Nest.

Ground Truth Extraction

If we have a MIDI file, and we know which song it is, and we have the audio (or features), how can we get useful information from the MIDI? The first thing we need to do is align the MIDI and audio file. There is code that exists for this, which may be quick enough and good enough but also may not. Once we have the MIDI file time aligned to the audio file, we can extract useful ground truth information. Packages for this exist, eg but there may be more information we can get, or in other formats:

  • Beats
  • Bar locations
  • Tempo
  • Time signature
  • Chords
  • Melody
  • Transcription


  • First couple days: Read papers, get used to Python MIDI, IPython Notebook, get used to MIDI files using the small dataset, get used to reading hdf5 files from MSD
  • Before the end of June: How can we resolve filename misspelling, abbreviation, lack of artist name, etc? Automatically, efficiently, over as much MIDI data as possible.
  • Early July: Beat synchronous chromagrams of MIDI
  • Late July: Chromagram matching, efficiently, accurately
  • August: Improvements, evaluation, ground truth extraction.
midi_dataset.txt ยท Last modified: 2015/12/17 21:59 (external edit)