OV-InstructTTS: Towards Open-Vocabulary Instruct Text-to-Speech

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing InstructTTS approaches in interpreting flexible, high-level natural language instructions, which hinders fine-grained user control over speech style. To overcome this, we propose a novel reasoning-driven text-to-speech synthesis paradigm tailored for open-vocabulary instructions. We first construct OV-Speech, a new dataset containing instruction-following examples with explicit reasoning chains, and then design an integrated framework that jointly performs natural language understanding and speech synthesis. This framework infers emotional, acoustic, and paralinguistic attributes from open-ended instructions to guide expressive speech generation. Experimental results demonstrate that our method significantly outperforms current models in both instruction-following accuracy and vocal expressiveness, exhibiting superior generalization capability and practical applicability.

Technology Category

Application Category

📝 Abstract
Instruct Text-to-Speech (InstructTTS) leverages natural language descriptions as style prompts to guide speech synthesis. However, existing InstructTTS methods mainly rely on a direct combination of audio-related labels or their diverse rephrasings, making it difficult to handle flexible, high-level instructions. Such rigid control is insufficient for users such as content creators who wish to steer generation with descriptive instructions. To address these constraints, we introduce OV-InstructTTS, a new paradigm for open-vocabulary InstructTTS. We propose a comprehensive solution comprising a newly curated dataset, OV-Speech, and a novel reasoning-driven framework. The OV-Speech dataset pairs speech with open-vocabulary instructions, each augmented with a reasoning process that connects high-level instructions to acoustic features. The reasoning-driven framework infers emotional, acoustic, and paralinguistic information from open-vocabulary instructions before synthesizing speech. Evaluations show that this reasoning-driven approach significantly improves instruction-following fidelity and speech expressiveness. We believe this work can inspire the next user-friendly InstructTTS systems with stronger generalization and real-world applicability. The dataset and demos are publicly available on our project page.
Problem

Research questions and friction points this paper is trying to address.

InstructTTS
open-vocabulary
speech synthesis
instruction following
expressive speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-Vocabulary
InstructTTS
Reasoning-Driven
Speech Synthesis
Instruction-Following
🔎 Similar Papers
No similar papers found.
Yong Ren
Yong Ren
Institute of Automation, Chinese Academy of Sciences
Speech CodecText-to-speechVideo-to-audioMLLMContinual Learning
Jiangyan Yi
Jiangyan Yi
Tsinghua University
speech signal processingspeech synthesisfake audio detectioncontinual learning
J
Jianhua Tao
Department of Automation, Tsinghua University, China
H
Haiyang Sun
The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, China
Zhengqi Wen
Zhengqi Wen
Tshinghua University
LLM
Hao Gu
Hao Gu
Sun Yat-Sen University
Planetary aeronomyAtmospheric escapeSpace physics
Le Xu
Le Xu
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Voice SynthesisAudio-visual Learning
Y
Ye Bai
The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, China