Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing emotional TTS research is largely confined to sentence-level control, impeding fine-grained word-level modulation of emotion and speaking rate—primarily due to the scarcity of intra-sentence emotion transition annotations and the difficulty of modeling dynamic multi-emotion evolution. To address this, we propose WeSCon, the first self-training framework that requires no intra-sentence fine-grained annotations. WeSCon end-to-end unlocks word-level expressiveness in pretrained zero-shot TTS models via dynamic emotion-aware attention biasing, smooth transition regularization, and dynamic speaking-rate control. Leveraging iterative inference and self-training fine-tuning, WeSCon achieves state-of-the-art performance on word-level emotion control, substantially alleviating annotation scarcity while preserving the original model’s strong zero-shot synthesis quality and cross-speaker generalization capability.

Technology Category

Application Category

📝 Abstract

While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the complexity of modeling multi-emotion transitions and the scarcity of annotated datasets that capture intra-sentence emotional and prosodic variation. In this paper, we propose WeSCon, the first self-training framework that enables word-level control of both emotion and speaking rate in a pretrained zero-shot TTS model, without relying on datasets containing intra-sentence emotion or speed transitions. Our method introduces a transition-smoothing strategy and a dynamic speed control mechanism to guide the pretrained TTS model in performing word-level expressive synthesis through a multi-round inference process. To further simplify the inference, we incorporate a dynamic emotional attention bias mechanism and fine-tune the model via self-training, thereby activating its ability for word-level expressive control in an end-to-end manner. Experimental results show that WeSCon effectively overcomes data scarcity, achieving state-of-the-art performance in word-level emotional expression control while preserving the strong zero-shot synthesis capabilities of the original TTS model.

Problem

Research questions and friction points this paper is trying to address.

Achieving word-level emotional expression control in zero-shot TTS

Overcoming data scarcity for intra-sentence emotional variation modeling

Enabling fine-grained control of emotion and speaking rate transitions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-training framework for word-level TTS control

Transition-smoothing strategy with dynamic speed mechanism

Dynamic emotional attention bias for end-to-end synthesis

🔎 Similar Papers

Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions

2024-09-25arXiv.orgCitations: 4