BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-speech (TTS) methods fail to effectively leverage large language models’ (LLMs) instruction-following capabilities, limiting controllability and cross-lingual generalization in TTS. This paper proposes BatonVoice—a decoupled, instruction-driven TTS framework inspired by operationalism: an LLM acts as the “conductor,” parsing user instructions into a textual control plan encoding acoustic attributes (e.g., pitch, energy); a dedicated TTS model, BatonTTS, serves as the “orchestra,” faithfully synthesizing speech from this plan. To our knowledge, this is the first work to introduce the operationalist paradigm to TTS, explicitly separating instruction interpretation from acoustic generation. BatonVoice enables zero-shot cross-lingual control and significantly outperforms state-of-the-art open-source and proprietary baselines on controllable and expressive TTS tasks, demonstrating exceptional generalization to unseen languages.

Technology Category

Application Category

📝 Abstract
The rise of Large Language Models (LLMs) is reshaping multimodel models, with speech synthesis being a prominent application. However, existing approaches often underutilize the linguistic intelligence of these models, typically failing to leverage their powerful instruction-following capabilities. This limitation hinders the model's ability to follow text instructions for controllable Text-to-Speech~(TTS). To address this, we propose a new paradigm inspired by ``operationalism'' that decouples instruction understanding from speech generation. We introduce BatonVoice, a framework where an LLM acts as a ``conductor'', understanding user instructions and generating a textual ``plan'' -- explicit vocal features (e.g., pitch, energy). A separate TTS model, the ``orchestra'', then generates the speech from these features. To realize this component, we develop BatonTTS, a TTS model trained specifically for this task. Our experiments demonstrate that BatonVoice achieves strong performance in controllable and emotional speech synthesis, outperforming strong open- and closed-source baselines. Notably, our approach enables remarkable zero-shot cross-lingual generalization, accurately applying feature control abilities to languages unseen during post-training. This demonstrates that objectifying speech into textual vocal features can more effectively unlock the linguistic intelligence of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Enhancing controllable speech synthesis with LLMs' linguistic intelligence
Decoupling instruction understanding from speech generation via operationalism
Achieving zero-shot cross-lingual generalization in speech synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples instruction understanding from speech generation
Uses LLM as conductor to generate vocal feature plans
Employs separate TTS model as orchestra for synthesis
🔎 Similar Papers
No similar papers found.
Y
Yue Wang
Tencent Multimodal Department
R
Ruotian Ma
Tencent Multimodal Department
X
Xingyu Chen
Tencent Multimodal Department
Zhengliang Shi
Zhengliang Shi
Shandong University
Natural Language ProcessingLLM AgentKnowledge Discovery
W
Wanshun Chen
Tencent Multimodal Department
H
Huang Liu
Tencent Multimodal Department
J
Jiadi Yao
Tencent Multimodal Department
Qu Yang
Qu Yang
National University of Singapore
Deep LearningSpiking Neural NetworkNeuromprphic Computing
Qingxuan Jiang
Qingxuan Jiang
Graduate Student, MIT
Machine LearningOptimization
Fanghua Ye
Fanghua Ye
University College London
Conversational AIAI AssistantsGraphNLPLLM
Juntao Li
Juntao Li
Soochow University
Language ModelsText Generation
M
Min Zhang
Soochow University
Zhaopeng Tu
Zhaopeng Tu
Tech Lead @ Tencent Digital Human
Digital HumanAgentsLarge Language ModelsMachine Translation
X
Xiaolong Li
Tencent Multimodal Department
L
Linus
Tencent Multimodal Department