DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video-to-audio (V2A) models largely neglect speech generation, resulting in incomplete audio tracks. This paper introduces the novel video-to-soundtrack (V2ST) task, aiming to jointly synthesize temporally aligned background sounds and speech. Methodologically, we propose a cross-modal alignment module to enhance audiovisual synchronization, incorporate curriculum learning to mitigate scarcity of real paired soundtrack data, and design a dual-decoder architecture grounded in a multimodal language model—integrating causal and non-causal attention mechanisms with a multimodal encoder for end-to-end joint generation. Evaluated on our newly constructed benchmark, DualBench, our approach achieves significant improvements in soundtrack completeness, audio fidelity, and audiovisual synchronization, establishing new state-of-the-art performance and validating the effectiveness of the proposed framework.

Technology Category

Application Category

📝 Abstract
While recent video-to-audio (V2A) models can generate realistic background audio from visual input, they largely overlook speech, an essential part of many video soundtracks. This paper proposes a new task, video-to-soundtrack (V2ST) generation, which aims to jointly produce synchronized background audio and speech within a unified framework. To tackle V2ST, we introduce DualDub, a unified framework built on a multimodal language model that integrates a multimodal encoder, a cross-modal aligner, and dual decoding heads for simultaneous background audio and speech generation. Specifically, our proposed cross-modal aligner employs causal and non-causal attention mechanisms to improve synchronization and acoustic harmony. Besides, to handle data scarcity, we design a curriculum learning strategy that progressively builds the multimodal capability. Finally, we introduce DualBench, the first benchmark for V2ST evaluation with a carefully curated test set and comprehensive metrics. Experimental results demonstrate that DualDub achieves state-of-the-art performance, generating high-quality and well-synchronized soundtracks with both speech and background audio.
Problem

Research questions and friction points this paper is trying to address.

Generates synchronized speech and background audio for videos
Addresses data scarcity via curriculum learning strategy
Introduces first benchmark for video-to-soundtrack evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for joint audio and speech generation
Cross-modal aligner with dual attention mechanisms
Curriculum learning strategy for data scarcity
🔎 Similar Papers
No similar papers found.