Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing speech translation methods achieve high translation quality but fail to preserve source speech duration, speaker identity, and speaking rate—limiting their applicability in dubbing for film and television. To address this, we propose the first end-to-end textless speech-to-speech cross-lingual dubbing translation framework. Our approach leverages a discrete diffusion model to translate speech into acoustic units, augmented with explicit duration control and textless speaking-rate adaptation—enabling temporally aligned, purely speech-based dubbing for the first time. By combining conditional flow-matching synthesis with unit-level speaking-rate guidance, the generated speech achieves high fidelity to the source in timbre, prosody, and temporal alignment. Experiments demonstrate state-of-the-art translation quality alongside significant improvements in dubbing naturalness and lip-sync accuracy, providing a practical, production-ready solution for audiovisual localization.

Technology Category

Application Category

📝 Abstract

This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed. Despite the strong translation quality of existing speech translation approaches, they often overlook the transfer of speech patterns, leading to mismatches with source speech and limiting their suitability for dubbing applications. To address this, we propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation. We then synthesize speech based on the predicted units and source identity with a conditional flow matching model. Additionally, we introduce a unit-based speed adaptation mechanism that guides the translation model to produce speech at a rate consistent with the source, without relying on any text. Extensive experiments demonstrate that our framework generates natural and fluent translations that align with the original speech's duration and speaking pace, while achieving competitive translation performance.

Problem

Research questions and friction points this paper is trying to address.

Preserves speech duration, identity, and speed in translation

Addresses mismatches in speech patterns for dubbing applications

Enables textless, time-aligned speech-to-speech translation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete diffusion-based speech-to-unit translation model

Conditional flow matching for speech synthesis

Unit-based speed adaptation mechanism

🔎 Similar Papers

No similar papers found.