SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Generating full-length lyric-aligned songs faces core challenges including structural incoherence, vocal-instrumental disharmony, and lyric-audio misalignment. To address these, we propose an interleaved autoregressive-diffusion co-generation paradigm: first, a lightweight autoregressive model generates coarse-grained musical sketches; then, a diffusion model progressively refines them through multi-stage denoising, enabling gradual expansion from short sketches to long sequences and capturing multi-granularity temporal dependencies. Concurrently, we introduce joint lyric-audio representation learning to deeply integrate semantic and acoustic priors. This design achieves superior trade-offs between global structural consistency and local audio fidelity. Comprehensive objective and subjective evaluations demonstrate that our method outperforms all existing open-source and commercial baselines across key metrics—marking the first successful realization of high-fidelity, end-to-end lyric-aligned song generation.

Technology Category

Application Category

📝 Abstract

Generating music with coherent structure, harmonious instrumental and vocal elements remains a significant challenge in song generation. Existing language models and diffusion-based methods often struggle to balance global coherence with local fidelity, resulting in outputs that lack musicality or suffer from incoherent progression and mismatched lyrics. This paper introduces $ extbf{SongBloom}$, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. SongBloom employs an autoregressive diffusion model that combines the high fidelity of diffusion models with the scalability of language models. Specifically, it gradually extends a musical sketch from short to long and refines the details from coarse to fine-grained. The interleaved generation paradigm effectively integrates prior semantic and acoustic context to guide the generation process. Experimental results demonstrate that SongBloom outperforms existing methods across both subjective and objective metrics and achieves performance comparable to the state-of-the-art commercial music generation platforms. Audio samples are available on our demo page: https://cypress-yang.github.io/SongBloom_demo.

Problem

Research questions and friction points this paper is trying to address.

Generating music with coherent structure and harmony

Balancing global coherence and local fidelity in songs

Overcoming incoherent progression and mismatched lyrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaved autoregressive sketching and diffusion refinement

Combines diffusion fidelity with language model scalability

Gradual sketch extension and detail refinement

🔎 Similar Papers

No similar papers found.