V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing video-to-audio generation models struggle to align with human preferences across multiple dimensions—semantic consistency, temporal alignment, and perceptual quality. To this end, the authors propose a Direct Preference Optimization (DPO) framework tailored for streaming video-to-audio generation, featuring a novel AudioScore system that enables large-scale automated preference pair generation. The approach further incorporates a curriculum learning strategy to accommodate the characteristics of streaming generation. Evaluated on the VGGSound dataset, the method significantly outperforms both DDPO and pretrained baselines, with the DPO-optimized MMAudio achieving state-of-the-art performance across multiple metrics.

Technology Category

Application Category

📝 Abstract
This paper introduces V2A-DPO, a novel Direct Preference Optimization (DPO) framework tailored for flow-based video-to-audio generation (V2A) models, incorporating key adaptations to effectively align generated audio with human preferences. Our approach incorporates three core innovations: (1) AudioScore-a comprehensive human preference-aligned scoring system for assessing semantic consistency, temporal alignment, and perceptual quality of synthesized audio; (2) an automated AudioScore-driven pipeline for generating large-scale preference pair data for DPO optimization; (3) a curriculum learning-empowered DPO optimization strategy specifically tailored for flow-based generative models. Experiments on benchmark VGGSound dataset demonstrate that human-preference aligned Frieren and MMAudio using V2A-DPO outperform their counterparts optimized using Denoising Diffusion Policy Optimization (DDPO) as well as pre-trained baselines. Furthermore, our DPO-optimized MMAudio achieves state-of-the-art performance across multiple metrics, surpassing published V2A models.
Problem

Research questions and friction points this paper is trying to address.

video-to-audio generation
human preference alignment
preference optimization
audio quality
temporal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct Preference Optimization
Video-to-Audio Generation
AudioScore
Curriculum Learning
Flow-based Generative Models
🔎 Similar Papers
No similar papers found.
N
Nolan Chan
The Chinese University of Hong Kong, Hong Kong SAR, China
T
Timmy Gang
National Research Council Canada, Canada
Y
Yongqian Wang
The University of Warwick, UK
Yuzhe Liang
Yuzhe Liang
Shanghai Jiao Tong University
Deep learningMultimodal Learning
D
Dingdong Wang
The Chinese University of Hong Kong, Hong Kong SAR, China