SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement

๐Ÿ“… 2025-06-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Generating full-length lyric-aligned songs faces core challenges including structural incoherence, vocal-instrumental disharmony, and lyric-audio misalignment. To address these, we propose an interleaved autoregressive-diffusion co-generation paradigm: first, a lightweight autoregressive model generates coarse-grained musical sketches; then, a diffusion model progressively refines them through multi-stage denoising, enabling gradual expansion from short sketches to long sequences and capturing multi-granularity temporal dependencies. Concurrently, we introduce joint lyric-audio representation learning to deeply integrate semantic and acoustic priors. This design achieves superior trade-offs between global structural consistency and local audio fidelity. Comprehensive objective and subjective evaluations demonstrate that our method outperforms all existing open-source and commercial baselines across key metricsโ€”marking the first successful realization of high-fidelity, end-to-end lyric-aligned song generation.

Technology Category

Application Category

๐Ÿ“ Abstract
Generating music with coherent structure, harmonious instrumental and vocal elements remains a significant challenge in song generation. Existing language models and diffusion-based methods often struggle to balance global coherence with local fidelity, resulting in outputs that lack musicality or suffer from incoherent progression and mismatched lyrics. This paper introduces $ extbf{SongBloom}$, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. SongBloom employs an autoregressive diffusion model that combines the high fidelity of diffusion models with the scalability of language models. Specifically, it gradually extends a musical sketch from short to long and refines the details from coarse to fine-grained. The interleaved generation paradigm effectively integrates prior semantic and acoustic context to guide the generation process. Experimental results demonstrate that SongBloom outperforms existing methods across both subjective and objective metrics and achieves performance comparable to the state-of-the-art commercial music generation platforms. Audio samples are available on our demo page: https://cypress-yang.github.io/SongBloom_demo.
Problem

Research questions and friction points this paper is trying to address.

Generating music with coherent structure and harmony
Balancing global coherence and local fidelity in songs
Overcoming incoherent progression and mismatched lyrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaved autoregressive sketching and diffusion refinement
Combines diffusion fidelity with language model scalability
Gradual sketch extension and detail refinement
๐Ÿ”Ž Similar Papers
No similar papers found.
C
Chenyu Yang
The Chinese University of Hong Kong, Shenzhen; Shenzhen Research Institution of Big Data
S
Shuai Wang
Nanjing University; Shenzhen Research Institution of Big Data
Hangting Chen
Hangting Chen
Tencent Hunyuan
signal processingspeech separationDCASE
W
Wei Tan
Tencent AI Lab
Jianwei Yu
Jianwei Yu
Tencent AI lab
ASR
Haizhou Li
Haizhou Li
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China; NUS, Singapore
Automatic Speech RecognitionSpeaker RecognitionLanguage RecognitionVoice ConversionMachine Translation