SLEEPING-DISCO 9M: A large-scale pre-training dataset for generative music modeling

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Current generative music modeling suffers from a lack of high-quality, open-source, authentic popular music datasets; mainstream resources rely on synthetic audio, re-recordings, or unscreened large-scale audio corpora, resulting in semantically impoverished content, stylistic distortion, and low community adoption. Method: We introduce the first large-scale, open-source dataset specifically designed for generative music modeling, systematically curating over 9 million authentic, commercially released popular songs—including works by globally renowned artists—and supporting diverse tasks including text-to-music generation, singing voice synthesis, melody reconstruction, and cross-modal retrieval. Our approach features copyright-compliant sampling, multi-source metadata alignment, and dual-stage quality filtering based on both audio fidelity and musical structure. Contribution/Results: The dataset enables robust cross-modal (text/audio/score) joint modeling and empirically yields substantial improvements in generation naturalness and stylistic consistency, advancing state-of-the-art performance across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

We present Sleeping-DISCO 9M, a large-scale pre-training dataset for music and song. To the best of our knowledge, there are no open-source high-quality dataset representing popular and well-known songs for generative music modeling tasks such as text-music, music-captioning, singing-voice synthesis, melody reconstruction and cross-model retrieval. Past contributions focused on isolated and constrained factors whose core perspective was to create synthetic or re-recorded music corpus (e.g. GTSinger, M4Singer) and arbitrarily large-scale audio datasets (e.g. DISCO-10M and LAIONDISCO-12M) had been another focus for the community. Unfortunately, adoption of these datasets has been below substantial in the generative music community as these datasets fail to reflect real-world music and its flavour. Our dataset changes this narrative and provides a dataset that is constructed using actual popular music and world-renowned artists.

Problem

Research questions and friction points this paper is trying to address.

Lack of open-source high-quality music datasets for generative tasks

Existing datasets fail to reflect real-world music and flavor

Need for dataset with actual popular music and renowned artists

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale pre-training dataset for music

Includes popular and well-known songs

Constructed using real-world popular music

🔎 Similar Papers

Unifying Multitrack Music Arrangement via Reconstruction Fine-Tuning and Efficient Tokenization

2024-08-27Citations: 0

Apple

Cupertino, United States of America

Authors to Follow