A Multi-Stage Framework for Multimodal Controllable Speech Synthesis

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing controllable speech synthesis methods suffer from three key limitations: face-based approaches exhibit poor generalization; text-prompted methods lack both diversity and fine-grained control; and multimodal methods heavily rely on perfectly aligned cross-modal training data. To address these issues, we propose a three-stage multimodal controllable speech synthesis framework. First, knowledge distillation enhances the robustness of the facial encoder. Second, cross-modal joint training disentangles textual and facial representations. Third, adaptive multimodal feature fusion enables flexible style modulation. This framework substantially reduces dependence on strictly paired data while supporting fine-grained and highly diverse voice style control. Experiments demonstrate significant improvements over unimodal baselines on both face-driven and text-guided tasks, achieving state-of-the-art performance in speech naturalness, style fidelity, and control accuracy.

Technology Category

Application Category

📝 Abstract
Controllable speech synthesis aims to control the style of generated speech using reference input, which can be of various modalities. Existing face-based methods struggle with robustness and generalization due to data quality constraints, while text prompt methods offer limited diversity and fine-grained control. Although multimodal approaches aim to integrate various modalities, their reliance on fully matched training data significantly constrains their performance and applicability. This paper proposes a 3-stage multimodal controllable speech synthesis framework to address these challenges. For face encoder, we use supervised learning and knowledge distillation to tackle generalization issues. Furthermore, the text encoder is trained on both text-face and text-speech data to enhance the diversity of the generated speech. Experimental results demonstrate that this method outperforms single-modal baseline methods in both face based and text prompt based speech synthesis, highlighting its effectiveness in generating high-quality speech.
Problem

Research questions and friction points this paper is trying to address.

Enhance robustness and generalization in face-based speech synthesis
Improve diversity and fine-grained control in text prompt methods
Overcome limitations of multimodal approaches with unmatched training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

3-stage multimodal framework for speech synthesis
Supervised learning and knowledge distillation for face encoder
Text encoder trained on text-face and text-speech data
🔎 Similar Papers
No similar papers found.
R
Rui Niu
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Weihao Wu
Weihao Wu
Tsinghua University
J
Jie Chen
Youtu Lab, Tencent, Beijing, China
Long Ma
Long Ma
Dalian University of Technology
Computer VisionImage Processing
Z
Zhiyong Wu
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China