Million Song Dataset

Million Song Dataset
Name	Million Song Dataset
Type	dataset
Released	2011
Genre	music information retrieval
Length	1,000,000 tracks (catalog)
Label	The Echo Nest, Columbia University, Microsoft Research
Producer	The Echo Nest, LabROSA, Columbia University

Contents

Overview
Data Composition and Features
Development and Contributors
Access and Licensing
Applications and Research Use
Limitations and Criticisms

Million Song Dataset The Million Song Dataset is a large-scale, curated corpus created for research in music information retrieval and audio analysis. Conceived as a collaboration among academic and industry institutions, it provides metadata, audio features, and structural annotations drawn from commercial recordings and digital archives. The dataset served as a benchmark for competitions and academic studies, influencing work in classification, recommendation, and musicology.

Overview

The dataset aggregates feature vectors and metadata derived from commercial catalogs maintained by organizations such as The Echo Nest, Columbia University, Microsoft Research, LabROSA and other research groups. It omits raw audio but furnishes time-aligned attributes commonly used in projects affiliated with International Society for Music Information Retrieval, ISRC-linked catalogs, and digital libraries like MusicBrainz, Discogs, AllMusic and streaming services. Designed to enable reproducible experiments, the corpus interfaces with tools developed at institutions including Massachusetts Institute of Technology, University of California, Berkeley, Queen Mary University of London, McGill University and University of Amsterdam.

Data Composition and Features

Contents include per-track summaries with high-level descriptors and low-level timbral features produced by signal processing pipelines used by The Echo Nest and research labs associated with Columbia University and Music Information Retrieval Lab. Feature sets comprise beat, tempo, chroma, loudness, and Mel-frequency cepstral coefficients derived from analyses influenced by algorithms from Marsyas, LibROSA, and models validated against corpora such as GTZAN and CAL500. Metadata fields link tracks to artists, albums, and genres cross-referenced with authority files like MusicBrainz, performer credits from Universal Music Group and Sony Music Entertainment, and release data present in catalogs managed by IFPI and RIAA. Structural annotations include segmentation boundaries and repeating patterns used in studies by teams at Princeton University, University of Illinois Urbana–Champaign, and University of London.

Development and Contributors

The project originated from collaborations among researchers at Columbia University's LabROSA, engineers at The Echo Nest, and partners in industry including Microsoft Research and music companies such as Warner Music Group. Key contributors include academics affiliated with Massachusetts Institute of Technology's Media Lab, doctoral groups from McGill University and Queen Mary University of London, and engineering teams from The Echo Nest and Spotify alumni. The initiative was showcased at conferences like ISMIR and NIPS where papers described dataset curation, normalization, and evaluation protocols used by groups at Stanford University, University of California, San Diego, and ETH Zurich.

Access and Licensing

Distribution was governed by agreements with rights holders and metadata providers including Universal Music Group, Sony Music Entertainment, Warner Music Group and independent labels represented in Discogs. The dataset is accompanied by a license specifying no redistribution of raw audio and constraints tied to commercial partners and rights organizations such as IFPI and RIAA. Access mechanisms evolved through hosted mirrors and institutional repositories at Columbia University, mirrors used by research consortia including International Music Information Retrieval Systems projects, and archival efforts by groups at Cornell University and Harvard University for reproducibility.

Applications and Research Use

Researchers employed the corpus for tasks in music recommendation evaluated against baselines popularized by engineers at The Echo Nest and practitioners at Spotify and Pandora Radio. Studies in genre classification cited comparative experiments with classifiers developed at Carnegie Mellon University, University of California, Berkeley, and Princeton University. The dataset underpinned research in cover song identification led by teams at Queen Mary University of London and McGill University, as well as tempo and beat-tracking evaluations used by groups at Massachusetts Institute of Technology and ETH Zurich. It was a standard resource in challenges coordinated by ISMIR and informed algorithmic advances presented at venues like ACM SIGIR and ACL workshops.

Limitations and Criticisms

Critiques centered on coverage, representativeness, and licensing: scholars at University of Amsterdam, University of York, and University of Oxford noted biases toward Western popular music and underrepresentation of noncommercial traditions cataloged in collections like Library of Congress and British Library. The absence of raw audio limited studies requiring waveform access common to researchers at Google Research and DeepMind, and licensing restrictions hampered reproducibility noted by teams from Stanford University and Harvard University. Issues of metadata accuracy prompted cross-validation efforts using authority files from MusicBrainz, Discogs, and national discographies compiled by institutions such as Bibliothèque nationale de France.

Category:Music datasets