YuE: Scaling Open Foundation Models for Long-Form Music Generation

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenging problem of end-to-end, long-horizon generation of complete songs from lyrics. To this end, we introduce YuE, an open-source foundational model family built upon the LLaMA2 architecture. Methodologically, we propose: (1) track-decoupled next-token prediction to ensure precise lyric–melody alignment; (2) structured, progressive lyric-conditioned modeling for coherent musical structure; (3) a multi-stage, multi-task pretraining paradigm; and (4) a reengineered in-context learning framework enabling bidirectional generation and cross-style transfer. The model employs token-level audio representations—explicitly decoupling vocal and instrumental tracks—complemented by structured positional encoding and curriculum-based pretraining. Experiments demonstrate that YuE matches or surpasses several proprietary systems in musical quality and vocal expressiveness; supports low-resource language fine-tuning and fine-grained controllability; and achieves state-of-the-art performance on the MARBLE benchmark via representation transfer.

Technology Category

Application Category

📝 Abstract
We tackle the task of long-form music generation--particularly the challenging extbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation
Problem

Research questions and friction points this paper is trying to address.

Develops YuE for long-form music generation with lyrical alignment.
Enables versatile style transfer and bidirectional music generation.
Enhances music understanding tasks with state-of-the-art performance.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Track-decoupled next-token prediction for dense signals
Structural progressive conditioning for lyrical alignment
Multitask, multiphase pre-training for generalization
🔎 Similar Papers
No similar papers found.
Ruibin Yuan
Ruibin Yuan
HKUST
Artificial IntelligenceMusic GenerationMusic Information RetrievalComputer Music
H
Hanfeng Lin
S
Shuyue Guo
G
Ge Zhang
Jiahao Pan
Jiahao Pan
Hong Kong University of Science and Technology
Speech ProcessingSpeech EnhancmentMusic Generation
Yongyi Zang
Yongyi Zang
Smule, Inc.
Computer AuditionSpeech ProcessingMusic Information RetrievalMusic Composition
Haohe Liu
Haohe Liu
Research Scientist at Meta AI
Audio GenerationAudio ClassificationSpeech Quality EnhancementMusic Source Separation
Yiming Liang
Yiming Liang
Institute of Automation of the Chinese Academy Sciences (CASIA), M-A-P
LLM
W
Wenye Ma
X
Xingjian Du
Xinrun Du
Xinrun Du
Multimodal Art Projection Research Community, 01.ai
LLM
Z
Zhen Ye
Tianyu Zheng
Tianyu Zheng
M-A-P & Tiktok Researcher
LLM
Yinghao Ma
Yinghao Ma
PhD candidate, Centre for Digital Music (C4DM), Queen Mary University of London
Music Information RetrievalLarge Language ModelsMultimodal LearningAudio Signal Processing
M
Minghao Liu
Zeyue Tian
Zeyue Tian
Hong Kong University of Science and Technology
Music GenerationGenerative AIMulti-Modal Learning
Ziya Zhou
Ziya Zhou
The Hong Kong University of Science and Technology
Music TechnologyNatural Language Processing
Liumeng Xue
Liumeng Xue
Hong Kong University of Science and Technology
Audio Speech and Language ProcessingSpeech Generation
X
Xingwei Qu
Yizhi Li
Yizhi Li
University of Manchester, M-A-P
LLMReasoningPost-trainingComputational Music
Shangda Wu
Shangda Wu
Tencent
Symbolic Music GenerationMusic Information RetrievalMultimodal Learning
T
Tianhao Shen
Z
Ziyang Ma
J
Junlin Zhan
C
Chunhui Wang
Y
Yatian Wang
X
Xiao-Qian Chi
Xinyue Zhang
Xinyue Zhang
Southwest University of Science and Technology
Machine Learning · Multi-view clustering
Z
Zhenzhu Yang
X
Xiangzhou Wang
Shansong Liu
Shansong Liu
TeleAI
Music AITTSLLMMulti-modal LLMAudio codec
L
Ling Mei
P
Peng Li
J
Junjie Wang
J
Jian-Xiu Yu
G
Guojian Pang
X
Xu Li
Z
Zihao Wang
Xiaohuan Zhou
Xiaohuan Zhou
bytedance
Lijun Yu
Lijun Yu
Google DeepMind
Video GenerationMultimodal Foundation Model
Emmanouil Benetos
Emmanouil Benetos
Queen Mary University of London
Machine listeningAudio signal processingMusic information retrievalMachine learning
Y
Yong Chen
C
Cheng-Ju Lin
X
Xie Chen
G
Gus Xia
Zhaoxiang Zhang
Zhaoxiang Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionPattern RecognitionBiologically-inspired Learning
C
Chao Zhang
Wenhu Chen
Wenhu Chen
Assistant Professor at University of Waterloo
Natural Language ProcessingArtificial IntelligenceDeep Learning
X
Xipeng Qiu
R
R. Dannenberg
J
Jia-Hua Liu
J
Jian Yang
W
Wenhao Huang
W
Wei Xue
X
Xu Tan
Y
Yike Guo