SLEEPING-DISCO 9M: A large-scale pre-training dataset for generative music modeling

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current generative music modeling suffers from a lack of high-quality, open-source, authentic popular music datasets; mainstream resources rely on synthetic audio, re-recordings, or unscreened large-scale audio corpora, resulting in semantically impoverished content, stylistic distortion, and low community adoption. Method: We introduce the first large-scale, open-source dataset specifically designed for generative music modeling, systematically curating over 9 million authentic, commercially released popular songs—including works by globally renowned artists—and supporting diverse tasks including text-to-music generation, singing voice synthesis, melody reconstruction, and cross-modal retrieval. Our approach features copyright-compliant sampling, multi-source metadata alignment, and dual-stage quality filtering based on both audio fidelity and musical structure. Contribution/Results: The dataset enables robust cross-modal (text/audio/score) joint modeling and empirically yields substantial improvements in generation naturalness and stylistic consistency, advancing state-of-the-art performance across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
We present Sleeping-DISCO 9M, a large-scale pre-training dataset for music and song. To the best of our knowledge, there are no open-source high-quality dataset representing popular and well-known songs for generative music modeling tasks such as text-music, music-captioning, singing-voice synthesis, melody reconstruction and cross-model retrieval. Past contributions focused on isolated and constrained factors whose core perspective was to create synthetic or re-recorded music corpus (e.g. GTSinger, M4Singer) and arbitrarily large-scale audio datasets (e.g. DISCO-10M and LAIONDISCO-12M) had been another focus for the community. Unfortunately, adoption of these datasets has been below substantial in the generative music community as these datasets fail to reflect real-world music and its flavour. Our dataset changes this narrative and provides a dataset that is constructed using actual popular music and world-renowned artists.
Problem

Research questions and friction points this paper is trying to address.

Lack of open-source high-quality music datasets for generative tasks
Existing datasets fail to reflect real-world music and flavor
Need for dataset with actual popular music and renowned artists
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale pre-training dataset for music
Includes popular and well-known songs
Constructed using real-world popular music
🔎 Similar Papers
No similar papers found.
T
Tawsif Ahmed
Sleeping AI
A
Andrej Radonjic
Wyndl Labs
Gollam Rabby
Gollam Rabby
Postdoctoral researcher
Ai4ScienceAI ScientistMachine Learning