π€ AI Summary
Existing controllable text-to-speech (TTS) methods rely on fixed textual prompts, struggling to simultaneously achieve high voice fidelity, fine-grained speaking style control, and visual-semantic alignment. To address this, we propose the first multimodal controllable TTS framework jointly driven by text, audio, and image prompts. Our method introduces a unified multimodal prompt encoder for cross-modal semantic alignment; designs a composable, multi-stage control architecture that explicitly decouples speaker identity, prosody, and speaking style modeling; and establishes the first end-to-end multimodal data collection and construction pipeline tailored for controllable TTS. Experiments demonstrate significant improvements over strong baselines: +0.42 in subjective MOS, along with consistent gains in speaker similarity (SIM) and character error rate (CER). The framework supports arbitrary combinations of input modalities to generate high-fidelity, highly controllable speech. Code and audio samples are publicly released.
π Abstract
Controllable speech generation methods typically rely on single or fixed prompts, hindering creativity and flexibility. These limitations make it difficult to meet specific user needs in certain scenarios, such as adjusting the style while preserving a selected speaker's timbre, or choosing a style and generating a voice that matches a character's visual appearance. To overcome these challenges, we propose extit{FleSpeech}, a novel multi-stage speech generation framework that allows for more flexible manipulation of speech attributes by integrating various forms of control. FleSpeech employs a multimodal prompt encoder that processes and unifies different text, audio, and visual prompts into a cohesive representation. This approach enhances the adaptability of speech synthesis and supports creative and precise control over the generated speech. Additionally, we develop a data collection pipeline for multimodal datasets to facilitate further research and applications in this field. Comprehensive subjective and objective experiments demonstrate the effectiveness of FleSpeech. Audio samples are available at https://kkksuper.github.io/FleSpeech/