SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing video–audio joint generation methods perform well at the semantic level but still struggle with fine-grained temporal synchronization—specifically, the precise alignment of audio events with their visual triggers. This work proposes SyncDPO, a novel framework that integrates Direct Preference Optimization (DPO) with rule-driven temporal perturbation to construct negative samples without additional sampling or annotations, combined with a curriculum learning strategy that progressively refines the model’s ability to discriminate temporal misalignments from coarse to fine granularity. Evaluated on four diverse benchmarks, SyncDPO significantly outperforms prior approaches, demonstrating superior temporal alignment and stronger out-of-distribution generalization in both objective metrics and subjective evaluations.

📝 Abstract

Recent advancements in video-audio joint generation have achieved remarkable success in semantic correspondence. However, achieving precise temporal synchronization, which requires fine-grained alignment between audio events and their visual triggers, remains a challenging problem. The post-training method for joint generation is largely dominated by Supervised Fine-Tuning, but the commonly used Mean Squared Error loss provides insufficient penalties for subtle temporal misalignments. Direct Preference Optimization offers an alternative by introducing explicit misaligned counterparts to better improve temporal sensitivity. In this paper we propose a post-training framework SyncDPO, leveraging DPO to improve the temporal sensitivity of V-A joint generation. Conventional DPO pipelines typically depend on costly sampling-and-ranking procedures to construct preference pairs, resulting in substantial computational cost. To improve efficiency, we introduce a suite of on-the-fly rule-based negative construction strategies that distort temporal structures without incurring additional annotation or sampling. We demonstrate that the temporal alignment capability can be effectively reinforced by providing explicit negative supervision through temporally distorted V-A pairs. Accordingly, we implement a curriculum learning strategy that progressively increases the difficulty of negative samples, transitioning from coarse misalignment to subtle inconsistencies. Extensive objective and subjective experiments across four diverse benchmarks, ranging from ambient sound videos to human speech videos, demonstrate that SyncDPO significantly outperforms other methods in improving model's temporal alignment capability. It also demonstrates superior generalization on out-of-distribution benchmark by capturing intrinsic motion-sound dynamics. Demo and code is available in https://syncdpo.github.io/syncdpo/.

Problem

Research questions and friction points this paper is trying to address.

temporal synchronization

video-audio joint generation

fine-grained alignment

audio-visual alignment

temporal misalignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Synchronization

Direct Preference Optimization

Negative Sample Construction