Bridging the gap between training and inference in LM-based TTS models

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address exposure bias arising from the train-inference discrepancy between teacher-forcing training and autoregressive inference in language model–based text-to-speech (TTS) systems, this paper proposes a prompt-guided hybrid training framework. The method jointly incorporates teacher-forcing and autoregressive generation, integrates self-generated tokens into training, and introduces a dynamic end-of-sequence (EOS) prediction mechanism to explicitly model termination conditions during synthesis—thereby substantially mitigating train-inference mismatch. Experiments demonstrate that the proposed approach significantly improves stability, naturalness, and sentence-level fluency in long-form speech synthesis, outperforming baseline models across multiple objective and subjective metrics. Notably, this work constitutes the first systematic investigation of exposure bias in LM-based TTS and provides a scalable, architecture-agnostic solution grounded in principled sequence modeling.

Technology Category

Application Category

📝 Abstract
Recent advancements in text-to-speech (TTS) have shown that language model (LM) based systems offer competitive performance compared to traditional approaches. However, in training, TTS models use ground-truth (GT) tokens as prefixes to predict the next token, while in inference these tokens are not available, a gap between training and inference that is often neglected. In this study, we propose a prompt-guided hybrid training scheme to mitigate exposure bias in popular LM-based TTS systems. Our core idea is to adopt a hybrid training paradigm that combines teacher forcing with free running, thereby introducing self-generated tokens into the training process. This makes the training mode more consistent with inference, reducing the training-inference gap. In addition, we incorporate an EOS prediction mechanism during training to detect incorrect sequence termination and adaptively control the free running process. Experimental results provide a comprehensive evaluation of the impact of exposure bias on LM-based TTS, and demonstrate that our method effectively narrows the training-inference gap, thereby improving the quality of synthesized long-form speech.
Problem

Research questions and friction points this paper is trying to address.

Addressing exposure bias in LM-based TTS systems
Bridging training-inference gap in text-to-speech models
Improving synthesized speech quality for long-form content
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid training combining teacher forcing and free running
Introducing self-generated tokens during training process
EOS prediction mechanism to control sequence termination
🔎 Similar Papers
No similar papers found.
R
Ruonan Zhang
Tsinghua University
Lingzhou Mu
Lingzhou Mu
清华大学
AIGCvideo generationAI security
Xixin Wu
Xixin Wu
The Chinese University of Hong Kong
K
Kai Zhang
Tsinghua University