Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing emotional TTS research is largely confined to sentence-level control, impeding fine-grained word-level modulation of emotion and speaking rate—primarily due to the scarcity of intra-sentence emotion transition annotations and the difficulty of modeling dynamic multi-emotion evolution. To address this, we propose WeSCon, the first self-training framework that requires no intra-sentence fine-grained annotations. WeSCon end-to-end unlocks word-level expressiveness in pretrained zero-shot TTS models via dynamic emotion-aware attention biasing, smooth transition regularization, and dynamic speaking-rate control. Leveraging iterative inference and self-training fine-tuning, WeSCon achieves state-of-the-art performance on word-level emotion control, substantially alleviating annotation scarcity while preserving the original model’s strong zero-shot synthesis quality and cross-speaker generalization capability.

Technology Category

Application Category

📝 Abstract
While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the complexity of modeling multi-emotion transitions and the scarcity of annotated datasets that capture intra-sentence emotional and prosodic variation. In this paper, we propose WeSCon, the first self-training framework that enables word-level control of both emotion and speaking rate in a pretrained zero-shot TTS model, without relying on datasets containing intra-sentence emotion or speed transitions. Our method introduces a transition-smoothing strategy and a dynamic speed control mechanism to guide the pretrained TTS model in performing word-level expressive synthesis through a multi-round inference process. To further simplify the inference, we incorporate a dynamic emotional attention bias mechanism and fine-tune the model via self-training, thereby activating its ability for word-level expressive control in an end-to-end manner. Experimental results show that WeSCon effectively overcomes data scarcity, achieving state-of-the-art performance in word-level emotional expression control while preserving the strong zero-shot synthesis capabilities of the original TTS model.
Problem

Research questions and friction points this paper is trying to address.

Achieving word-level emotional expression control in zero-shot TTS
Overcoming data scarcity for intra-sentence emotional variation modeling
Enabling fine-grained control of emotion and speaking rate transitions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-training framework for word-level TTS control
Transition-smoothing strategy with dynamic speed mechanism
Dynamic emotional attention bias for end-to-end synthesis
🔎 Similar Papers
No similar papers found.
Tianrui Wang
Tianrui Wang
Tianjin University
Speech Signal Processing
H
Haoyu Wang
Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University
Meng Ge
Meng Ge
Tianjin University; CUHK-Shenzhen; National University of Singapore
C
Cheng Gong
TeleAI, China Telecom
Chunyu Qiang
Chunyu Qiang
Kuaishou Technology; TJU; CASIA
Speech Synthesis
Z
Ziyang Ma
Nanyang Technological University; Shanghai Jiao Tong University
Z
Zikang Huang
Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University
Guanrou Yang
Guanrou Yang
Shanghai Jiao Tong University
Xiaobao Wang
Xiaobao Wang
天津大学 Associate Professor
人工智能,大模型生成安全,图机器学习
E
Eng Siong Chng
Nanyang Technological University
X
Xie Chen
Shanghai Jiao Tong University
Longbiao Wang
Longbiao Wang
Professor, Tianjin University
Speech ProcessingSpeech recognitionspeaker recognitionacoustic signal processingspeech enhancement
Jianwu Dang
Jianwu Dang
JAIST, Japan / Tianjin Univ., China
Speech Sciencespeech productionEEGdisorder speech