Bridging the gap between training and inference in LM-based TTS models

📅 2025-09-21

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

To address exposure bias arising from the train-inference discrepancy between teacher-forcing training and autoregressive inference in language model–based text-to-speech (TTS) systems, this paper proposes a prompt-guided hybrid training framework. The method jointly incorporates teacher-forcing and autoregressive generation, integrates self-generated tokens into training, and introduces a dynamic end-of-sequence (EOS) prediction mechanism to explicitly model termination conditions during synthesis—thereby substantially mitigating train-inference mismatch. Experiments demonstrate that the proposed approach significantly improves stability, naturalness, and sentence-level fluency in long-form speech synthesis, outperforming baseline models across multiple objective and subjective metrics. Notably, this work constitutes the first systematic investigation of exposure bias in LM-based TTS and provides a scalable, architecture-agnostic solution grounded in principled sequence modeling.

Technology Category

Application Category

📝 Abstract

Recent advancements in text-to-speech (TTS) have shown that language model (LM) based systems offer competitive performance compared to traditional approaches. However, in training, TTS models use ground-truth (GT) tokens as prefixes to predict the next token, while in inference these tokens are not available, a gap between training and inference that is often neglected. In this study, we propose a prompt-guided hybrid training scheme to mitigate exposure bias in popular LM-based TTS systems. Our core idea is to adopt a hybrid training paradigm that combines teacher forcing with free running, thereby introducing self-generated tokens into the training process. This makes the training mode more consistent with inference, reducing the training-inference gap. In addition, we incorporate an EOS prediction mechanism during training to detect incorrect sequence termination and adaptively control the free running process. Experimental results provide a comprehensive evaluation of the impact of exposure bias on LM-based TTS, and demonstrate that our method effectively narrows the training-inference gap, thereby improving the quality of synthesized long-form speech.

Problem

Research questions and friction points this paper is trying to address.

Addressing exposure bias in LM-based TTS systems

Bridging training-inference gap in text-to-speech models

Improving synthesized speech quality for long-form content

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid training combining teacher forcing and free running

Introducing self-generated tokens during training process

EOS prediction mechanism to control sequence termination

🔎 Similar Papers

No similar papers found.