🤖 AI Summary
This work addresses the limitation of existing InstructTTS approaches in interpreting flexible, high-level natural language instructions, which hinders fine-grained user control over speech style. To overcome this, we propose a novel reasoning-driven text-to-speech synthesis paradigm tailored for open-vocabulary instructions. We first construct OV-Speech, a new dataset containing instruction-following examples with explicit reasoning chains, and then design an integrated framework that jointly performs natural language understanding and speech synthesis. This framework infers emotional, acoustic, and paralinguistic attributes from open-ended instructions to guide expressive speech generation. Experimental results demonstrate that our method significantly outperforms current models in both instruction-following accuracy and vocal expressiveness, exhibiting superior generalization capability and practical applicability.
📝 Abstract
Instruct Text-to-Speech (InstructTTS) leverages natural language descriptions as style prompts to guide speech synthesis. However, existing InstructTTS methods mainly rely on a direct combination of audio-related labels or their diverse rephrasings, making it difficult to handle flexible, high-level instructions. Such rigid control is insufficient for users such as content creators who wish to steer generation with descriptive instructions. To address these constraints, we introduce OV-InstructTTS, a new paradigm for open-vocabulary InstructTTS. We propose a comprehensive solution comprising a newly curated dataset, OV-Speech, and a novel reasoning-driven framework. The OV-Speech dataset pairs speech with open-vocabulary instructions, each augmented with a reasoning process that connects high-level instructions to acoustic features. The reasoning-driven framework infers emotional, acoustic, and paralinguistic information from open-vocabulary instructions before synthesizing speech. Evaluations show that this reasoning-driven approach significantly improves instruction-following fidelity and speech expressiveness. We believe this work can inspire the next user-friendly InstructTTS systems with stronger generalization and real-world applicability. The dataset and demos are publicly available on our project page.