Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the problem of unnatural speech and limited fine-grained emotional control in text-to-speech (TTS) caused by conflicts between textual semantics and emotional style prompts, this paper proposes a large language model (LLM)-driven adaptive classifier-free guidance (CFG) method. Our approach jointly leverages an LLM and a natural language inference (NLI) model to assess the semantic consistency between input text and style prompts, dynamically modulating the CFG scale accordingly to achieve an optimal trade-off between emotional expressiveness and speech quality. Experimental results demonstrate that the proposed method significantly enhances emotional fidelity while preserving high audio fidelity and intelligibility—outperforming both fixed-scale CFG and baseline models. This work establishes a novel paradigm for controllable emotional synthesis in autoregressive TTS systems.

Technology Category

Application Category

📝 Abstract
While Text-to-Speech (TTS) systems can achieve fine-grained control over emotional expression via natural language prompts, a significant challenge emerges when the desired emotion (style prompt) conflicts with the semantic content of the text. This mismatch often results in unnatural-sounding speech, undermining the goal of achieving fine-grained emotional control. Classifier-Free Guidance (CFG) is a key technique for enhancing prompt alignment; however, its application to auto-regressive (AR) TTS models remains underexplored, which can lead to degraded audio quality. This paper directly addresses the challenge of style-content mismatch in AR TTS models by proposing an adaptive CFG scheme that adjusts to different levels of the detected mismatch, as measured using large language models or natural language inference models. This solution is based on a comprehensive analysis of CFG's impact on emotional expressiveness in state-of-the-art AR TTS models. Our results demonstrate that the proposed adaptive CFG scheme improves the emotional expressiveness of the AR TTS model while maintaining audio quality and intelligibility.
Problem

Research questions and friction points this paper is trying to address.

Addresses style-content mismatch in auto-regressive TTS models
Proposes adaptive CFG to handle emotion-text conflicts
Maintains audio quality while improving emotional expressiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive CFG adjusts to style-content mismatch levels
Uses LLMs to detect emotion-text conflict
Maintains audio quality while enhancing expressiveness
🔎 Similar Papers
No similar papers found.
Y
Yizhou Peng
Alibaba-NTU Global e-Sustainability CorpLab, Nanyang Technological University, Singapore
Yukun Ma
Yukun Ma
Alibaba Group
ASRSLU
C
Chong Zhang
Alibaba, Alibaba Inc., Singapore
Y
Yi-Wen Chao
College of Computing and Data Science, Nanyang Technological University, Singapore
C
Chongjia Ni
Alibaba, Alibaba Inc., Singapore
B
Bin Ma
Alibaba, Alibaba Inc., Singapore