Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

In multilingual video dubbing, disparities in information density between source and target languages cause severe speech duration mismatches, degrading lip-sync accuracy and viewer experience. To address this, we propose Segment-level Supervised Preference Optimization (SSPO), which formulates duration alignment as a sequence-level preference learning task. SSPO employs a segment-wise sampling strategy and a fine-grained, duration-aware loss function to achieve precise source–target speech temporal alignment. The method is an end-to-end framework integrating large language models, neural machine translation, text-to-speech synthesis, and preference optimization, augmented with segment-level supervision. Experiments demonstrate that SSPO significantly outperforms strong baselines across key metrics—including duration deviation, synchronization accuracy, and naturalness—yielding substantial improvements in dubbing quality and perceptual user experience.

Technology Category

Application Category

📝 Abstract

Video dubbing aims to translate original speech in visual media programs from the source language to the target language, relying on neural machine translation and text-to-speech technologies. Due to varying information densities across languages, target speech often mismatches the source speech duration, causing audio-video synchronization issues that significantly impact viewer experience. In this study, we approach duration alignment in LLM-based video dubbing machine translation as a preference optimization problem. We propose the Segment Supervised Preference Optimization (SSPO) method, which employs a segment-wise sampling strategy and fine-grained loss to mitigate duration mismatches between source and target lines. Experimental results demonstrate that SSPO achieves superior performance in duration alignment tasks.

Problem

Research questions and friction points this paper is trying to address.

Aligns target speech duration with source in video dubbing

Reduces audio-video sync issues from language density differences

Optimizes segment-wise duration via supervised preference learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment Supervised Preference Optimization method

Segment-wise sampling strategy

Fine-grained loss for duration alignment

🔎 Similar Papers

No similar papers found.