Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
In multilingual video dubbing, disparities in information density between source and target languages cause severe speech duration mismatches, degrading lip-sync accuracy and viewer experience. To address this, we propose Segment-level Supervised Preference Optimization (SSPO), which formulates duration alignment as a sequence-level preference learning task. SSPO employs a segment-wise sampling strategy and a fine-grained, duration-aware loss function to achieve precise source–target speech temporal alignment. The method is an end-to-end framework integrating large language models, neural machine translation, text-to-speech synthesis, and preference optimization, augmented with segment-level supervision. Experiments demonstrate that SSPO significantly outperforms strong baselines across key metrics—including duration deviation, synchronization accuracy, and naturalness—yielding substantial improvements in dubbing quality and perceptual user experience.

Technology Category

Application Category

📝 Abstract
Video dubbing aims to translate original speech in visual media programs from the source language to the target language, relying on neural machine translation and text-to-speech technologies. Due to varying information densities across languages, target speech often mismatches the source speech duration, causing audio-video synchronization issues that significantly impact viewer experience. In this study, we approach duration alignment in LLM-based video dubbing machine translation as a preference optimization problem. We propose the Segment Supervised Preference Optimization (SSPO) method, which employs a segment-wise sampling strategy and fine-grained loss to mitigate duration mismatches between source and target lines. Experimental results demonstrate that SSPO achieves superior performance in duration alignment tasks.
Problem

Research questions and friction points this paper is trying to address.

Aligns target speech duration with source in video dubbing
Reduces audio-video sync issues from language density differences
Optimizes segment-wise duration via supervised preference learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment Supervised Preference Optimization method
Segment-wise sampling strategy
Fine-grained loss for duration alignment
🔎 Similar Papers
No similar papers found.