Towards Hallucination-Free Music: A Reinforcement Learning Preference Optimization Framework for Reliable Song Generation

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Lyrics-to-song generation suffers from pervasive content hallucination—i.e., generated audio deviates semantically and phonemically from the input lyrics. Method: We propose the first reinforcement learning–based preference optimization framework explicitly designed for lyric alignment. Our approach constructs a hallucination-oriented preference dataset using phoneme error rate (PER) and rule-based filtering, and systematically adapts DPO, PPO, and GRPO to this task for the first time. It integrates phoneme-level alignment modeling with KL regularization to iteratively suppress hallucination without requiring additional human annotations. Contribution/Results: Experiments demonstrate that DPO reduces PER by 7.4% significantly; subjective evaluation confirms concurrent improvements in musical quality and stylistic consistency. This work establishes a novel paradigm for enhancing content controllability in generative music models.

Technology Category

Application Category

📝 Abstract
Recent advances in audio-based generative language models have accelerated AI-driven lyric-to-song generation. However, these models frequently suffer from content hallucination, producing outputs misaligned with the input lyrics and undermining musical coherence. Current supervised fine-tuning (SFT) approaches, limited by passive label-fitting, exhibit constrained self-improvement and poor hallucination mitigation. To address this core challenge, we propose a novel reinforcement learning (RL) framework leveraging preference optimization for hallucination control. Our key contributions include: (1) Developing a robust hallucination preference dataset constructed via phoneme error rate (PER) computation and rule-based filtering to capture alignment with human expectations; (2) Implementing and evaluating three distinct preference optimization strategies within the RL framework: Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and Group Relative Policy Optimization (GRPO). DPO operates off-policy to enhance positive token likelihood, achieving a significant 7.4% PER reduction. PPO and GRPO employ an on-policy approach, training a PER-based reward model to iteratively optimize sequences via reward maximization and KL-regularization, yielding PER reductions of 4.9% and 4.7%, respectively. Comprehensive objective and subjective evaluations confirm that our methods effectively suppress hallucinations while preserving musical quality. Crucially, this work presents a systematic, RL-based solution to hallucination control in lyric-to-song generation. The framework's transferability also unlocks potential for music style adherence and musicality enhancement, opening new avenues for future generative song research.
Problem

Research questions and friction points this paper is trying to address.

Reducing content hallucination in AI-generated lyric-to-song outputs
Improving alignment between input lyrics and generated song content
Enhancing musical coherence through reinforcement learning optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning framework for hallucination control
Preference optimization strategies like DPO, PPO, GRPO
Phoneme error rate based dataset for alignment
🔎 Similar Papers
No similar papers found.
H
Huaicheng Zhang
Wuhan University, Wuhan, China
W
Wei Tan
Tencent AI Lab
G
Guangzheng Li
Tencent AI Lab
Y
Yixuan Zhang
Tencent AI Lab
Hangting Chen
Hangting Chen
Tencent Hunyuan
signal processingspeech separationDCASE
Shun Lei
Shun Lei
PhD student, Tsinghua University
Speech synthesisMusic generationDance GenerationSinging Voice Synthesis
C
Chenyu Yang
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Shenzhen, China
Z
Zhiyong Wu
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
S
Shuai Wang
School of Intelligence Science and Technology, Nanjing University, Suzhou, China
Q
Qijun Huang
Wuhan University, Wuhan, China
D
Dong Yu
Tencent AI Lab