Towards Hallucination-Free Music: A Reinforcement Learning Preference Optimization Framework for Reliable Song Generation

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Lyrics-to-song generation suffers from pervasive content hallucination—i.e., generated audio deviates semantically and phonemically from the input lyrics. Method: We propose the first reinforcement learning–based preference optimization framework explicitly designed for lyric alignment. Our approach constructs a hallucination-oriented preference dataset using phoneme error rate (PER) and rule-based filtering, and systematically adapts DPO, PPO, and GRPO to this task for the first time. It integrates phoneme-level alignment modeling with KL regularization to iteratively suppress hallucination without requiring additional human annotations. Contribution/Results: Experiments demonstrate that DPO reduces PER by 7.4% significantly; subjective evaluation confirms concurrent improvements in musical quality and stylistic consistency. This work establishes a novel paradigm for enhancing content controllability in generative music models.

Technology Category

Application Category

📝 Abstract

Recent advances in audio-based generative language models have accelerated AI-driven lyric-to-song generation. However, these models frequently suffer from content hallucination, producing outputs misaligned with the input lyrics and undermining musical coherence. Current supervised fine-tuning (SFT) approaches, limited by passive label-fitting, exhibit constrained self-improvement and poor hallucination mitigation. To address this core challenge, we propose a novel reinforcement learning (RL) framework leveraging preference optimization for hallucination control. Our key contributions include: (1) Developing a robust hallucination preference dataset constructed via phoneme error rate (PER) computation and rule-based filtering to capture alignment with human expectations; (2) Implementing and evaluating three distinct preference optimization strategies within the RL framework: Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and Group Relative Policy Optimization (GRPO). DPO operates off-policy to enhance positive token likelihood, achieving a significant 7.4% PER reduction. PPO and GRPO employ an on-policy approach, training a PER-based reward model to iteratively optimize sequences via reward maximization and KL-regularization, yielding PER reductions of 4.9% and 4.7%, respectively. Comprehensive objective and subjective evaluations confirm that our methods effectively suppress hallucinations while preserving musical quality. Crucially, this work presents a systematic, RL-based solution to hallucination control in lyric-to-song generation. The framework's transferability also unlocks potential for music style adherence and musicality enhancement, opening new avenues for future generative song research.

Problem

Research questions and friction points this paper is trying to address.

Reducing content hallucination in AI-generated lyric-to-song outputs

Improving alignment between input lyrics and generated song content

Enhancing musical coherence through reinforcement learning optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning framework for hallucination control

Preference optimization strategies like DPO, PPO, GRPO

Phoneme error rate based dataset for alignment

🔎 Similar Papers

No similar papers found.