🤖 AI Summary
This work addresses the challenging problem of end-to-end, long-horizon generation of complete songs from lyrics. To this end, we introduce YuE, an open-source foundational model family built upon the LLaMA2 architecture. Methodologically, we propose: (1) track-decoupled next-token prediction to ensure precise lyric–melody alignment; (2) structured, progressive lyric-conditioned modeling for coherent musical structure; (3) a multi-stage, multi-task pretraining paradigm; and (4) a reengineered in-context learning framework enabling bidirectional generation and cross-style transfer. The model employs token-level audio representations—explicitly decoupling vocal and instrumental tracks—complemented by structured positional encoding and curriculum-based pretraining. Experiments demonstrate that YuE matches or surpasses several proprietary systems in musical quality and vocal expressiveness; supports low-resource language fine-tuning and fine-grained controllability; and achieves state-of-the-art performance on the MARBLE benchmark via representation transfer.
📝 Abstract
We tackle the task of long-form music generation--particularly the challenging extbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation