SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

186K/year
🤖 AI Summary
Existing video–audio joint generation methods perform well at the semantic level but still struggle with fine-grained temporal synchronization—specifically, the precise alignment of audio events with their visual triggers. This work proposes SyncDPO, a novel framework that integrates Direct Preference Optimization (DPO) with rule-driven temporal perturbation to construct negative samples without additional sampling or annotations, combined with a curriculum learning strategy that progressively refines the model’s ability to discriminate temporal misalignments from coarse to fine granularity. Evaluated on four diverse benchmarks, SyncDPO significantly outperforms prior approaches, demonstrating superior temporal alignment and stronger out-of-distribution generalization in both objective metrics and subjective evaluations.
📝 Abstract
Recent advancements in video-audio joint generation have achieved remarkable success in semantic correspondence. However, achieving precise temporal synchronization, which requires fine-grained alignment between audio events and their visual triggers, remains a challenging problem. The post-training method for joint generation is largely dominated by Supervised Fine-Tuning, but the commonly used Mean Squared Error loss provides insufficient penalties for subtle temporal misalignments. Direct Preference Optimization offers an alternative by introducing explicit misaligned counterparts to better improve temporal sensitivity. In this paper we propose a post-training framework SyncDPO, leveraging DPO to improve the temporal sensitivity of V-A joint generation. Conventional DPO pipelines typically depend on costly sampling-and-ranking procedures to construct preference pairs, resulting in substantial computational cost. To improve efficiency, we introduce a suite of on-the-fly rule-based negative construction strategies that distort temporal structures without incurring additional annotation or sampling. We demonstrate that the temporal alignment capability can be effectively reinforced by providing explicit negative supervision through temporally distorted V-A pairs. Accordingly, we implement a curriculum learning strategy that progressively increases the difficulty of negative samples, transitioning from coarse misalignment to subtle inconsistencies. Extensive objective and subjective experiments across four diverse benchmarks, ranging from ambient sound videos to human speech videos, demonstrate that SyncDPO significantly outperforms other methods in improving model's temporal alignment capability. It also demonstrates superior generalization on out-of-distribution benchmark by capturing intrinsic motion-sound dynamics. Demo and code is available in https://syncdpo.github.io/syncdpo/.
Problem

Research questions and friction points this paper is trying to address.

temporal synchronization
video-audio joint generation
fine-grained alignment
audio-visual alignment
temporal misalignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Synchronization
Direct Preference Optimization
Negative Sample Construction
Curriculum Learning
Video-Audio Joint Generation