🤖 AI Summary
Current lyric-to-song generation methods suffer from poor audio fidelity, weak musicality, inadequate instruction following, and vocal-instrumental inharmony—largely due to the complexity of musical structure and scarcity of high-quality paired data. To address these challenges, we propose an end-to-end controllable generation framework. Our method introduces a novel hybrid tokenization scheme with parallel dual-track token modeling to jointly represent vocals and accompaniment; a DPO-based multi-preference alignment strategy enabling semi-automatic preference construction and optimization; and a decoder-only dual-Transformer architecture integrated with a music-specific neural codec and modular scalable training. Extensive evaluations demonstrate state-of-the-art performance across objective metrics (MOS, STOI, MCD) and subjective listening tests. Ablation studies confirm the efficacy of each component. Code and audio demos are publicly released.
📝 Abstract
Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in sound quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, an LM-based framework consisting of LeLM and a music codec. LeLM is capable of parallelly modeling two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and DPO post-training. Experimental results demonstrate that LeVo consistently outperforms existing methods on both objective and subjective metrics. Ablation studies further justify the effectiveness of our designs. Audio examples are available at https://levo-demo.github.io/.