Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing emotional TTS systems predominantly rely on sentence-level controls—such as discrete labels, reference audio, or textual prompts—making them inadequate for modeling intra-sentential dynamic emotional variation. To address this, we propose Emo-FiLM, the first framework enabling word-level fine-grained emotional speech synthesis. It employs emotion2vec to extract frame-level affective features, aligns them to the word level via forced alignment, and introduces FiLM layers to dynamically modulate word embeddings for emotion-aware text representation enhancement. The model is built upon an LLM-TTS architecture for end-to-end training. To enable rigorous fine-grained evaluation, we introduce FEDD—the first emotional speech dataset annotated with word-level emotion transitions. Experiments demonstrate that Emo-FiLM significantly outperforms state-of-the-art methods on both global emotional expressiveness and word-level emotion transition fidelity, substantially improving the naturalness, accuracy, and expressivity of synthesized speech.

Technology Category

Application Category

📝 Abstract
Emotional text-to-speech (E-TTS) is central to creating natural and trustworthy human-computer interaction. Existing systems typically rely on sentence-level control through predefined labels, reference audio, or natural language prompts. While effective for global emotion expression, these approaches fail to capture dynamic shifts within a sentence. To address this limitation, we introduce Emo-FiLM, a fine-grained emotion modeling framework for LLM-based TTS. Emo-FiLM aligns frame-level features from emotion2vec to words to obtain word-level emotion annotations, and maps them through a Feature-wise Linear Modulation (FiLM) layer, enabling word-level emotion control by directly modulating text embeddings. To support evaluation, we construct the Fine-grained Emotion Dynamics Dataset (FEDD) with detailed annotations of emotional transitions. Experiments show that Emo-FiLM outperforms existing approaches on both global and fine-grained tasks, demonstrating its effectiveness and generality for expressive speech synthesis.
Problem

Research questions and friction points this paper is trying to address.

Capturing dynamic emotional shifts within sentences for speech synthesis
Enabling word-level emotion control instead of sentence-level modulation
Overcoming limitations of predefined labels for fine-grained emotional expression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Word-level emotion control via FiLM modulation
Frame-to-word emotion alignment using emotion2vec
Fine-grained emotional dynamics dataset construction
🔎 Similar Papers
No similar papers found.
Sirui Wang
Sirui Wang
Meituan
NLPLLM
A
Andong Chen
Harbin Institute of Technology, Harbin, China
T
Tiejun Zhao
Harbin Institute of Technology, Harbin, China