Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work addresses the scarcity of high-quality, large-scale piano MIDI datasets for symbolic music modeling. We propose the first end-to-end automated pipeline integrating audio filtering, transcription, and annotation—combining a large language model (for metadata-aware audio discovery and quality assessment), an audio classifier (for performance identification and precise segmentation), and a standardized MIDI parsing toolkit. Leveraging this pipeline, we construct PianoMillion, the largest open-source piano MIDI dataset to date, comprising over one million distinct MIDI files spanning approximately 100,000 hours of performance. All samples undergo rigorous quality filtering, structural standardization, and are enriched with fine-grained, searchable metadata tags. PianoMillion significantly strengthens the data foundation for symbolic music modeling and provides critical resources for generative, analytical, and interpretive research in music AI.

Technology Category

Application Category

📝 Abstract

We introduce an extensive new dataset of MIDI files, created by transcribing audio recordings of piano performances into their constituent notes. The data pipeline we use is multi-stage, employing a language model to autonomously crawl and score audio recordings from the internet based on their metadata, followed by a stage of pruning and segmentation using an audio classifier. The resulting dataset contains over one million distinct MIDI files, comprising roughly 100,000 hours of transcribed audio. We provide an in-depth analysis of our techniques, offering statistical insights, and investigate the content by extracting metadata tags, which we also provide. Dataset available at https://github.com/loubbrad/aria-midi.

Problem

Research questions and friction points this paper is trying to address.

Create a large piano MIDI dataset from audio transcriptions

Develop a multi-stage pipeline for MIDI file processing

Analyze and provide metadata for symbolic music modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses language model to crawl audio metadata

Employs audio classifier for pruning segmentation

Generates over one million MIDI files

🔎 Similar Papers

No similar papers found.