MR-FlowDPO: Multi-Reward Direct Preference Optimization for Flow-Matching Text-to-Music Generation

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Music generation faces significant challenges in aligning with human preferences due to the highly subjective nature of audio evaluation. To address this, we propose the first multi-reward collaborative Direct Preference Optimization (DPO) framework for text-to-music generation, built upon a flow-matching architecture. Our method jointly models three complementary reward signals: text alignment, audio fidelity, and semantic consistency. Crucially, we introduce a novel rhythm stability scoring mechanism grounded in semantic self-supervised representations—designed to mitigate ambiguity inherent in subjective assessments. Extensive experiments demonstrate that our approach consistently outperforms strong baselines across both objective metrics and human evaluations: achieving +18.7% improvement in audio quality, +22.3% in text alignment, and +15.9% in musicality preference rate. The source code and audio samples are publicly released.

Technology Category

Application Category

📝 Abstract

A key challenge in music generation models is their lack of direct alignment with human preferences, as music evaluation is inherently subjective and varies widely across individuals. We introduce MR-FlowDPO, a novel approach that enhances flow-matching-based music generation models - a major class of modern music generative models, using Direct Preference Optimization (DPO) with multiple musical rewards. The rewards are crafted to assess music quality across three key dimensions: text alignment, audio production quality, and semantic consistency, utilizing scalable off-the-shelf models for each reward prediction. We employ these rewards in two ways: (i) By constructing preference data for DPO and (ii) by integrating the rewards into text prompting. To address the ambiguity in musicality evaluation, we propose a novel scoring mechanism leveraging semantic self-supervised representations, which significantly improves the rhythmic stability of generated music. We conduct an extensive evaluation using a variety of music-specific objective metrics as well as a human study. Results show that MR-FlowDPO significantly enhances overall music generation quality and is consistently preferred over highly competitive baselines in terms of audio quality, text alignment, and musicality. Our code is publicly available at https://github.com/lonzi/mrflow_dpo; Samples are provided in our demo page at https://lonzi.github.io/mr_flowdpo_demopage/.

Problem

Research questions and friction points this paper is trying to address.

Enhances music generation alignment with human preferences

Addresses subjective music evaluation using multiple reward dimensions

Improves rhythmic stability through semantic self-supervised representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses multi-reward DPO for flow-matching music generation

Integrates text alignment, audio quality, and semantic consistency rewards

Proposes semantic self-supervised scoring for rhythmic stability

🔎 Similar Papers

Melody-Guided Music Generation