🤖 AI Summary
To address the lack of structured annotations in song data and the high cost of manual preprocessing, this paper proposes SongPrep, an automated preprocessing framework comprising a modular pipeline (SongPrep) and an end-to-end model (SongPrepE2E). Both components bypass source separation entirely, instead jointly performing high-accuracy song structure segmentation (e.g., verse/chorus) and timestamped lyric transcription by fusing full-track audio context with pretrained semantic knowledge. SongPrepE2E further employs cross-modal sequence modeling to achieve structure-aware automatic speech recognition. On the SSLD-200 dataset, it achieves significant reductions in word error rate (WER) and speaker error rate. When integrated into downstream music generation models, its outputs yield musical quality approaching human-level creativity. This work presents the first end-to-end, context-aware song structure parsing and alignment framework that eliminates the need for source separation—establishing a high-quality, low-cost data foundation for generative music modeling.
📝 Abstract
Artificial Intelligence Generated Content (AIGC) is currently a popular research area. Among its various branches, song generation has attracted growing interest. Despite the abundance of available songs, effective data preparation remains a significant challenge. Converting these songs into training-ready datasets typically requires extensive manual labeling, which is both time consuming and costly. To address this issue, we propose SongPrep, an automated preprocessing pipeline designed specifically for song data. This framework streamlines key processes such as source separation, structure analysis, and lyric recognition, producing structured data that can be directly used to train song generation models. Furthermore, we introduce SongPrepE2E, an end-to-end structured lyrics recognition model based on pretrained language models. Without the need for additional source separation, SongPrepE2E is able to analyze the structure and lyrics of entire songs and provide precise timestamps. By leveraging context from the whole song alongside pretrained semantic knowledge, SongPrepE2E achieves low Diarization Error Rate (DER) and Word Error Rate (WER) on the proposed SSLD-200 dataset. Downstream tasks demonstrate that training song generation models with the data output by SongPrepE2E enables the generated songs to closely resemble those produced by humans.