Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement

๐Ÿ“… 2025-10-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing emotional TTS and voice conversion methods rely on reference encoders to extract global style vectors, limiting their ability to model fine-grained acoustic characteristics in reference speech and resulting in severe entanglement between emotion and timbre. To address this, we propose a mutual-information-guided phoneme-level emotionโ€“timbre disentanglement framework. Our method employs a dual-branch feature extractor to separately model emotion and timbre representations, and enforces explicit disentanglement via mutual information minimization constraints. Crucially, we introduce the first phoneme-level emotion embedding prediction mechanism, significantly improving dynamic emotion modeling fidelity. Experiments demonstrate that our approach substantially outperforms baseline systems, achieving gains of +0.42 in naturalness (MOS) and +12.6% in emotion expressiveness (EMOS). Moreover, it enhances the capability to capture subtle acoustic features from reference speech and improves generation flexibility.

Technology Category

Application Category

๐Ÿ“ Abstract
Current emotional Text-To-Speech (TTS) and style transfer methods rely on reference encoders to control global style or emotion vectors, but do not capture nuanced acoustic details of the reference speech. To this end, we propose a novel emotional TTS method that enables fine-grained phoneme-level emotion embedding prediction while disentangling intrinsic attributes of the reference speech. The proposed method employs a style disentanglement method to guide two feature extractors, reducing mutual information between timbre and emotion features, and effectively separating distinct style components from the reference speech. Experimental results demonstrate that our method outperforms baseline TTS systems in generating natural and emotionally rich speech. This work highlights the potential of disentangled and fine-grained representations in advancing the quality and flexibility of emotional TTS systems.
Problem

Research questions and friction points this paper is trying to address.

Disentangles emotion and timbre features from reference speech
Enables fine-grained phoneme-level emotion embedding prediction
Improves emotional TTS quality by separating distinct style components
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mutual information guides emotion-timbre disentanglement
Fine-grained phoneme-level emotion embedding prediction
Style disentanglement separates distinct reference speech components
๐Ÿ”Ž Similar Papers
No similar papers found.