EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study addresses the challenge of jointly modeling emotional expressiveness and prosodic prominence in expressive speech synthesis—specifically, how to enhance emotional intensity while preserving the clarity and stability of target prominence across diverse emotional contexts. To this end, we propose EME-TTS, a novel framework introducing a weakly supervised prominence modeling paradigm that integrates LLM-generated prominence pseudo-labels with variance-driven emphasis feature extraction. We further design an Emphasis-Prominence Enhancement (EPE) module that explicitly couples emotional representations with prominence positions. Experimental results demonstrate significant improvements in prominence discriminability, emotional naturalness, and overall perceptual quality across multiple emotion conditions. Notably, EME-TTS achieves the first stable and controllable modeling of emotion–prominence interaction, establishing a new paradigm for highly expressive text-to-speech synthesis.

Technology Category

Application Category

📝 Abstract

In recent years, emotional Text-to-Speech (TTS) synthesis and emphasis-controllable speech synthesis have advanced significantly. However, their interaction remains underexplored. We propose Emphasis Meets Emotion TTS (EME-TTS), a novel framework designed to address two key research questions: (1) how to effectively utilize emphasis to enhance the expressiveness of emotional speech, and (2) how to maintain the perceptual clarity and stability of target emphasis across different emotions. EME-TTS employs weakly supervised learning with emphasis pseudo-labels and variance-based emphasis features. Additionally, the proposed Emphasis Perception Enhancement (EPE) block enhances the interaction between emotional signals and emphasis positions. Experimental results show that EME-TTS, when combined with large language models for emphasis position prediction, enables more natural emotional speech synthesis while preserving stable and distinguishable target emphasis across emotions. Synthesized samples are available on-line.

Problem

Research questions and friction points this paper is trying to address.

Enhancing emotional speech expressiveness using emphasis

Maintaining clarity and stability of emphasis across emotions

Improving interaction between emotion signals and emphasis positions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weakly supervised learning with emphasis pseudo-labels

Variance-based emphasis features for clarity

Emphasis Perception Enhancement block for interaction

🔎 Similar Papers

No similar papers found.