VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis

📅 2026-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generating immersive auditory experiences that align with real-world physical scenes in speech synthesis, a task hindered by the scarcity of multimodal data and insufficient disentanglement across modalities. To this end, we propose a unified vision-driven high-fidelity speech synthesis framework. We first construct a large-scale multimodal dataset that establishes strong associations among visual scenes, speaker identity, and audio signals. We then introduce a disentangled memory bank architecture coupled with a D-MSVA alignment module and employ a cross-modal hybrid supervision strategy to achieve fine-grained acoustic feature alignment. Experimental results demonstrate that our approach significantly outperforms existing baselines in terms of audio fidelity, content intelligibility, and multimodal consistency, with both subjective and objective evaluations confirming its superiority.

Technology Category

Application Category

📝 Abstract
We introduce and define a novel task-Scene-Aware Visually-Driven Speech Synthesis, aimed at addressing the limitations of existing speech generation models in creating immersive auditory experiences that align with the real physical world. To tackle the two core challenges of data scarcity and modality decoupling, we propose VividVoice, a unified generative framework. First, we constructed a large-scale, high-quality hybrid multimodal dataset, Vivid-210K, which, through an innovative programmatic pipeline, establishes a strong correlation between visual scenes, speaker identity, and audio for the first time. Second, we designed a core alignment module, D-MSVA, which leverages a decoupled memory bank architecture and a cross-modal hybrid supervision strategy to achieve fine-grained alignment from visual scenes to timbre and environmental acoustic features. Both subjective and objective experimental results provide strong evidence that VividVoice significantly outperforms existing baseline models in terms of audio fidelity, content clarity, and multimodal consistency. Our demo is available at https://chengyuann.github.io/VividVoice/.
Problem

Research questions and friction points this paper is trying to address.

Scene-Aware Speech Synthesis
Visually-Driven Speech
Multimodal Alignment
Immersive Audio
Data Scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scene-Aware Speech Synthesis
Multimodal Alignment
Visually-Driven Generation
Decoupled Memory Bank
Hybrid Supervision
🔎 Similar Papers
No similar papers found.
C
Chengyuan Ma
Shenzhen International Graduate School, Tsinghua University, China
J
Jiawei Jin
Shenzhen International Graduate School, Tsinghua University, China
R
Ruijie Xiong
Ant Group, China
C
Chunxiang Jin
Ant Group, China
C
Canxiang Yan
Ant Group, China
Wenming Yang
Wenming Yang
Tsinghua University
Computer VisionImage Processing