๐ค AI Summary
Generating full-length lyric-aligned songs faces core challenges including structural incoherence, vocal-instrumental disharmony, and lyric-audio misalignment. To address these, we propose an interleaved autoregressive-diffusion co-generation paradigm: first, a lightweight autoregressive model generates coarse-grained musical sketches; then, a diffusion model progressively refines them through multi-stage denoising, enabling gradual expansion from short sketches to long sequences and capturing multi-granularity temporal dependencies. Concurrently, we introduce joint lyric-audio representation learning to deeply integrate semantic and acoustic priors. This design achieves superior trade-offs between global structural consistency and local audio fidelity. Comprehensive objective and subjective evaluations demonstrate that our method outperforms all existing open-source and commercial baselines across key metricsโmarking the first successful realization of high-fidelity, end-to-end lyric-aligned song generation.
๐ Abstract
Generating music with coherent structure, harmonious instrumental and vocal elements remains a significant challenge in song generation. Existing language models and diffusion-based methods often struggle to balance global coherence with local fidelity, resulting in outputs that lack musicality or suffer from incoherent progression and mismatched lyrics. This paper introduces $ extbf{SongBloom}$, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. SongBloom employs an autoregressive diffusion model that combines the high fidelity of diffusion models with the scalability of language models. Specifically, it gradually extends a musical sketch from short to long and refines the details from coarse to fine-grained. The interleaved generation paradigm effectively integrates prior semantic and acoustic context to guide the generation process. Experimental results demonstrate that SongBloom outperforms existing methods across both subjective and objective metrics and achieves performance comparable to the state-of-the-art commercial music generation platforms. Audio samples are available on our demo page: https://cypress-yang.github.io/SongBloom_demo.