DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing video-to-audio (V2A) models largely neglect speech generation, resulting in incomplete audio tracks. This paper introduces the novel video-to-soundtrack (V2ST) task, aiming to jointly synthesize temporally aligned background sounds and speech. Methodologically, we propose a cross-modal alignment module to enhance audiovisual synchronization, incorporate curriculum learning to mitigate scarcity of real paired soundtrack data, and design a dual-decoder architecture grounded in a multimodal language model—integrating causal and non-causal attention mechanisms with a multimodal encoder for end-to-end joint generation. Evaluated on our newly constructed benchmark, DualBench, our approach achieves significant improvements in soundtrack completeness, audio fidelity, and audiovisual synchronization, establishing new state-of-the-art performance and validating the effectiveness of the proposed framework.

Technology Category

Application Category

📝 Abstract

While recent video-to-audio (V2A) models can generate realistic background audio from visual input, they largely overlook speech, an essential part of many video soundtracks. This paper proposes a new task, video-to-soundtrack (V2ST) generation, which aims to jointly produce synchronized background audio and speech within a unified framework. To tackle V2ST, we introduce DualDub, a unified framework built on a multimodal language model that integrates a multimodal encoder, a cross-modal aligner, and dual decoding heads for simultaneous background audio and speech generation. Specifically, our proposed cross-modal aligner employs causal and non-causal attention mechanisms to improve synchronization and acoustic harmony. Besides, to handle data scarcity, we design a curriculum learning strategy that progressively builds the multimodal capability. Finally, we introduce DualBench, the first benchmark for V2ST evaluation with a carefully curated test set and comprehensive metrics. Experimental results demonstrate that DualDub achieves state-of-the-art performance, generating high-quality and well-synchronized soundtracks with both speech and background audio.

Problem

Research questions and friction points this paper is trying to address.

Generates synchronized speech and background audio for videos

Addresses data scarcity via curriculum learning strategy

Introduces first benchmark for video-to-soundtrack evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for joint audio and speech generation

Cross-modal aligner with dual attention mechanisms

Curriculum learning strategy for data scarcity

🔎 Similar Papers

No similar papers found.