The Lakh MIDI Dataset v0.1

The Lakh MIDI dataset is a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. Its goal is to facilitate large-scale music information retrieval, both symbolic (using the MIDI files alone) and audio content-based (using information extracted from the MIDI files as annotations for the matched audio files).

Table of Contents

  1. Get the dataset
  2. License/attribution
  3. Learn
  4. Frequently asked Questions

Get the dataset

LMD-full - The full collection of 176,581 deduped MIDI files. Each file is named according to its MD5 checksum. Please note - no attempt was made to remove invalid MIDI files from this collection; as a result it contains a few thousand files which are likely corrupt. [mirror 1]
LMD-matched - A subset of 45,129 files from LMD-full which have been matched to entries in the Million Song Dataset. [mirror 1]
LMD-aligned - All of the files in LMD-matched, aligned to the 7digital preview MP3s from the Million Song Dataset. If you need the preview clips themselves, contact me. [mirror 1]
LMD-full filenames - A json file which relates the MD5 checksum to the original filenames for all entries in LMD-full. [mirror 1] [mirror 2]
Match scores - A json file which lists the match confidence score for every match in LMD-matched and LMD-aligned. [mirror 1] [mirror 2]
LMD-matched metadata - HDF5 files from the Million Song Dataset for every entry in LMD-matched. See here for a tutorial on using this data. [mirror 1]
Clean MIDI subset - A subset of MIDI files with filenames which indicate their artist and title (with some inaccuracy), as used in a few of my papers. [mirror 1]

License/attribution

The Lakh MIDI Dataset is distributed with a CC-BY 4.0 license; if you use this data in any capacity, please reference this page and my thesis:

Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.

Of course, I did not transcribe any of the MIDI files in the Lakh MIDI Dataset. While MIDI files have a built-in mechanism for attribution (the Copyright meta-event), it is not used consistently, so attributing each of the MIDI files in the dataset to a particular author is not feasible. If you'd like to try, here is a list of the text of all of the Copyright meta-events in the Lakh MIDI Dataset.

If you use the Million Song Dataset, please reference this paper:

Thierry Bertin-Mahieux, Daniel P. W. Ellis, Brian Whitman, and Paul Lamere. "The Million Song Dataset". In Proceedings of the 12th International Society for Music Information Retrieval Conference, pages 591–596, 2011.

Learn

To facilitate use of this dataset, here are a few IPython notebook tutorials:

Frequently asked questions

What kind of information can I get from MIDI files? In a simplistic view, a MIDI file can be considered a score with additional optional annotations. As a result, you can count on getting a transcription of the song, as well as meter information such as beats and downbeats. In some cases, MIDI files include key signature annotations and lyrics, among other useful things. For a discussion of the presence of these different information sources in files in the Lakh MIDI Dataset, see this tutorial.

How reliable are MIDI-derived annotations? This gets at two questions: How reliable are the annotations in MIDI files, and how accurately was the MIDI file aligned to the audio recording? This tutorial addresses these questions in detail.

How reliable is the matching procedure? A MIDI-audio pair was considered a valid match based on the confidence score reported by dynamic time warping-based alignment, which turns out to be extremely reliable. For more discussion and concrete details, see section 4.5 of my thesis. However, the DTW-based alignment scheme is intentionally somewhat invariant to differences in instrumentation. As a result, songs which are harmonically similar may be matched incorrectly. As a concrete example, it's not uncommon for transcriptions of house music to be erroneously matched to dozens of house remixes.

How were the matched and aligned datasets assembled? In short, I developed series of efficient learning-based methods to discard the vast majority of possible matches the Million Song Dataset. The remaining entries were compared using standard (and computationally expensive) dynamic time warping-based MIDI-to-audio alignment. For a thorough discussion, please see chapters 4-7 of my thesis. And, of course, all of the code used in this project is available here.

Where did these MIDI files come from? They were scraped from publicly-available sources on the internet, and then de-duped according to their MD5 checksum.

What is a lakh? A lakh is a unit of measure used in the Indian number system which signifies 100,000 (or, in the Indian convention, 1,00,000). Depending on how you count, the Lakh MIDI Dataset includes about 100,000 MIDI files. The name is a play on the Million Song Dataset, which includes metadata and features for 1,000,000 music recordings.