FleSpeech: Flexibly Controllable Speech Generation with Various Prompts

πŸ“… 2025-01-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing controllable text-to-speech (TTS) methods rely on fixed textual prompts, struggling to simultaneously achieve high voice fidelity, fine-grained speaking style control, and visual-semantic alignment. To address this, we propose the first multimodal controllable TTS framework jointly driven by text, audio, and image prompts. Our method introduces a unified multimodal prompt encoder for cross-modal semantic alignment; designs a composable, multi-stage control architecture that explicitly decouples speaker identity, prosody, and speaking style modeling; and establishes the first end-to-end multimodal data collection and construction pipeline tailored for controllable TTS. Experiments demonstrate significant improvements over strong baselines: +0.42 in subjective MOS, along with consistent gains in speaker similarity (SIM) and character error rate (CER). The framework supports arbitrary combinations of input modalities to generate high-fidelity, highly controllable speech. Code and audio samples are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Controllable speech generation methods typically rely on single or fixed prompts, hindering creativity and flexibility. These limitations make it difficult to meet specific user needs in certain scenarios, such as adjusting the style while preserving a selected speaker's timbre, or choosing a style and generating a voice that matches a character's visual appearance. To overcome these challenges, we propose extit{FleSpeech}, a novel multi-stage speech generation framework that allows for more flexible manipulation of speech attributes by integrating various forms of control. FleSpeech employs a multimodal prompt encoder that processes and unifies different text, audio, and visual prompts into a cohesive representation. This approach enhances the adaptability of speech synthesis and supports creative and precise control over the generated speech. Additionally, we develop a data collection pipeline for multimodal datasets to facilitate further research and applications in this field. Comprehensive subjective and objective experiments demonstrate the effectiveness of FleSpeech. Audio samples are available at https://kkksuper.github.io/FleSpeech/
Problem

Research questions and friction points this paper is trying to address.

Controllable Speech Synthesis
Lack of Creativity
Flexibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

FleSpeech
Multi-step Voice Synthesis
Flexible Input Handling
πŸ”Ž Similar Papers
No similar papers found.
Hanzhao Li
Hanzhao Li
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern
Speech SynthesisSpontaneous SpeechSpeech Codec
Y
Yuke Li
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
Xinsheng Wang
Xinsheng Wang
Hong Kong University of Science and Technology (HKUST)
speech synthesissinging voice synthesisvoice conversion
J
Jingbin Hu
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
Q
Qicong Xie
Tencent AI Lab, China
S
Shan Yang
Tencent AI Lab, China
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China