🤖 AI Summary
Existing emotional TTS systems predominantly rely on sentence-level controls—such as discrete labels, reference audio, or textual prompts—making them inadequate for modeling intra-sentential dynamic emotional variation. To address this, we propose Emo-FiLM, the first framework enabling word-level fine-grained emotional speech synthesis. It employs emotion2vec to extract frame-level affective features, aligns them to the word level via forced alignment, and introduces FiLM layers to dynamically modulate word embeddings for emotion-aware text representation enhancement. The model is built upon an LLM-TTS architecture for end-to-end training. To enable rigorous fine-grained evaluation, we introduce FEDD—the first emotional speech dataset annotated with word-level emotion transitions. Experiments demonstrate that Emo-FiLM significantly outperforms state-of-the-art methods on both global emotional expressiveness and word-level emotion transition fidelity, substantially improving the naturalness, accuracy, and expressivity of synthesized speech.
📝 Abstract
Emotional text-to-speech (E-TTS) is central to creating natural and trustworthy human-computer interaction. Existing systems typically rely on sentence-level control through predefined labels, reference audio, or natural language prompts. While effective for global emotion expression, these approaches fail to capture dynamic shifts within a sentence. To address this limitation, we introduce Emo-FiLM, a fine-grained emotion modeling framework for LLM-based TTS. Emo-FiLM aligns frame-level features from emotion2vec to words to obtain word-level emotion annotations, and maps them through a Feature-wise Linear Modulation (FiLM) layer, enabling word-level emotion control by directly modulating text embeddings. To support evaluation, we construct the Fine-grained Emotion Dynamics Dataset (FEDD) with detailed annotations of emotional transitions. Experiments show that Emo-FiLM outperforms existing approaches on both global and fine-grained tasks, demonstrating its effectiveness and generality for expressive speech synthesis.