YuE: Scaling Open Foundation Models for Long-Form Music Generation

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenging problem of end-to-end, long-horizon generation of complete songs from lyrics. To this end, we introduce YuE, an open-source foundational model family built upon the LLaMA2 architecture. Methodologically, we propose: (1) track-decoupled next-token prediction to ensure precise lyric–melody alignment; (2) structured, progressive lyric-conditioned modeling for coherent musical structure; (3) a multi-stage, multi-task pretraining paradigm; and (4) a reengineered in-context learning framework enabling bidirectional generation and cross-style transfer. The model employs token-level audio representations—explicitly decoupling vocal and instrumental tracks—complemented by structured positional encoding and curriculum-based pretraining. Experiments demonstrate that YuE matches or surpasses several proprietary systems in musical quality and vocal expressiveness; supports low-resource language fine-tuning and fine-grained controllability; and achieves state-of-the-art performance on the MARBLE benchmark via representation transfer.

Technology Category

Application Category

📝 Abstract

We tackle the task of long-form music generation--particularly the challenging extbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation

Problem

Research questions and friction points this paper is trying to address.

Develops YuE for long-form music generation with lyrical alignment.

Enables versatile style transfer and bidirectional music generation.

Enhances music understanding tasks with state-of-the-art performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Track-decoupled next-token prediction for dense signals

Structural progressive conditioning for lyrical alignment

Multitask, multiphase pre-training for generalization

🔎 Similar Papers

No similar papers found.