EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of jointly modeling emotional expressiveness and prosodic prominence in expressive speech synthesis—specifically, how to enhance emotional intensity while preserving the clarity and stability of target prominence across diverse emotional contexts. To this end, we propose EME-TTS, a novel framework introducing a weakly supervised prominence modeling paradigm that integrates LLM-generated prominence pseudo-labels with variance-driven emphasis feature extraction. We further design an Emphasis-Prominence Enhancement (EPE) module that explicitly couples emotional representations with prominence positions. Experimental results demonstrate significant improvements in prominence discriminability, emotional naturalness, and overall perceptual quality across multiple emotion conditions. Notably, EME-TTS achieves the first stable and controllable modeling of emotion–prominence interaction, establishing a new paradigm for highly expressive text-to-speech synthesis.

Technology Category

Application Category

📝 Abstract
In recent years, emotional Text-to-Speech (TTS) synthesis and emphasis-controllable speech synthesis have advanced significantly. However, their interaction remains underexplored. We propose Emphasis Meets Emotion TTS (EME-TTS), a novel framework designed to address two key research questions: (1) how to effectively utilize emphasis to enhance the expressiveness of emotional speech, and (2) how to maintain the perceptual clarity and stability of target emphasis across different emotions. EME-TTS employs weakly supervised learning with emphasis pseudo-labels and variance-based emphasis features. Additionally, the proposed Emphasis Perception Enhancement (EPE) block enhances the interaction between emotional signals and emphasis positions. Experimental results show that EME-TTS, when combined with large language models for emphasis position prediction, enables more natural emotional speech synthesis while preserving stable and distinguishable target emphasis across emotions. Synthesized samples are available on-line.
Problem

Research questions and friction points this paper is trying to address.

Enhancing emotional speech expressiveness using emphasis
Maintaining clarity and stability of emphasis across emotions
Improving interaction between emotion signals and emphasis positions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weakly supervised learning with emphasis pseudo-labels
Variance-based emphasis features for clarity
Emphasis Perception Enhancement block for interaction
🔎 Similar Papers
No similar papers found.
H
Haoxun Li
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, China
Leyuan Qu
Leyuan Qu
Hangzhou Institute for Advanced Study, UCAS
Speech Representation LearningMulti-modal Learning and Affective Computing
J
Jiaxi Hu
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, China
T
Taihao Li
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, China