🤖 AI Summary
In multilingual video dubbing, disparities in information density between source and target languages cause severe speech duration mismatches, degrading lip-sync accuracy and viewer experience. To address this, we propose Segment-level Supervised Preference Optimization (SSPO), which formulates duration alignment as a sequence-level preference learning task. SSPO employs a segment-wise sampling strategy and a fine-grained, duration-aware loss function to achieve precise source–target speech temporal alignment. The method is an end-to-end framework integrating large language models, neural machine translation, text-to-speech synthesis, and preference optimization, augmented with segment-level supervision. Experiments demonstrate that SSPO significantly outperforms strong baselines across key metrics—including duration deviation, synchronization accuracy, and naturalness—yielding substantial improvements in dubbing quality and perceptual user experience.
📝 Abstract
Video dubbing aims to translate original speech in visual media programs from the source language to the target language, relying on neural machine translation and text-to-speech technologies. Due to varying information densities across languages, target speech often mismatches the source speech duration, causing audio-video synchronization issues that significantly impact viewer experience. In this study, we approach duration alignment in LLM-based video dubbing machine translation as a preference optimization problem. We propose the Segment Supervised Preference Optimization (SSPO) method, which employs a segment-wise sampling strategy and fine-grained loss to mitigate duration mismatches between source and target lines. Experimental results demonstrate that SSPO achieves superior performance in duration alignment tasks.