🤖 AI Summary
This work addresses the challenge of generating immersive auditory experiences that align with real-world physical scenes in speech synthesis, a task hindered by the scarcity of multimodal data and insufficient disentanglement across modalities. To this end, we propose a unified vision-driven high-fidelity speech synthesis framework. We first construct a large-scale multimodal dataset that establishes strong associations among visual scenes, speaker identity, and audio signals. We then introduce a disentangled memory bank architecture coupled with a D-MSVA alignment module and employ a cross-modal hybrid supervision strategy to achieve fine-grained acoustic feature alignment. Experimental results demonstrate that our approach significantly outperforms existing baselines in terms of audio fidelity, content intelligibility, and multimodal consistency, with both subjective and objective evaluations confirming its superiority.
📝 Abstract
We introduce and define a novel task-Scene-Aware Visually-Driven Speech Synthesis, aimed at addressing the limitations of existing speech generation models in creating immersive auditory experiences that align with the real physical world. To tackle the two core challenges of data scarcity and modality decoupling, we propose VividVoice, a unified generative framework. First, we constructed a large-scale, high-quality hybrid multimodal dataset, Vivid-210K, which, through an innovative programmatic pipeline, establishes a strong correlation between visual scenes, speaker identity, and audio for the first time. Second, we designed a core alignment module, D-MSVA, which leverages a decoupled memory bank architecture and a cross-modal hybrid supervision strategy to achieve fine-grained alignment from visual scenes to timbre and environmental acoustic features. Both subjective and objective experimental results provide strong evidence that VividVoice significantly outperforms existing baseline models in terms of audio fidelity, content clarity, and multimodal consistency. Our demo is available at https://chengyuann.github.io/VividVoice/.