Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multilingual video dubbing, disparities in information density between source and target languages cause severe speech duration mismatches, degrading lip-sync accuracy and viewer experience. To address this, we propose Segment-level Supervised Preference Optimization (SSPO), which formulates duration alignment as a sequence-level preference learning task. SSPO employs a segment-wise sampling strategy and a fine-grained, duration-aware loss function to achieve precise source–target speech temporal alignment. The method is an end-to-end framework integrating large language models, neural machine translation, text-to-speech synthesis, and preference optimization, augmented with segment-level supervision. Experiments demonstrate that SSPO significantly outperforms strong baselines across key metrics—including duration deviation, synchronization accuracy, and naturalness—yielding substantial improvements in dubbing quality and perceptual user experience.

Technology Category

Application Category

📝 Abstract
Video dubbing aims to translate original speech in visual media programs from the source language to the target language, relying on neural machine translation and text-to-speech technologies. Due to varying information densities across languages, target speech often mismatches the source speech duration, causing audio-video synchronization issues that significantly impact viewer experience. In this study, we approach duration alignment in LLM-based video dubbing machine translation as a preference optimization problem. We propose the Segment Supervised Preference Optimization (SSPO) method, which employs a segment-wise sampling strategy and fine-grained loss to mitigate duration mismatches between source and target lines. Experimental results demonstrate that SSPO achieves superior performance in duration alignment tasks.
Problem

Research questions and friction points this paper is trying to address.

Aligns target speech duration with source in video dubbing
Reduces audio-video sync issues from language density differences
Optimizes segment-wise duration via supervised preference learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment Supervised Preference Optimization method
Segment-wise sampling strategy
Fine-grained loss for duration alignment
🔎 Similar Papers
No similar papers found.
Chaoqun Cui
Chaoqun Cui
Institute of Automation, Chinese Academy of Sciences
Machine LearningNatural Language Processing
L
Liangbin Huang
Alibaba Digital Media and Entertainment Group
Shijing Wang
Shijing Wang
beijing jiaotong university
deep learning
Z
Zhe Tong
Alibaba Digital Media and Entertainment Group
Z
Zhaolong Huang
Alibaba Digital Media and Entertainment Group
X
Xiao Zeng
Alibaba Digital Media and Entertainment Group
X
Xiaofeng Liu
School of Software Engineering, Huazhong University of Science and Technology