VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

πŸ“… 2024-01-25
πŸ›οΈ IEEE International Conference on Acoustics, Speech, and Signal Processing
πŸ“ˆ Citations: 23
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Decoder-only Transformer-based TTS models (e.g., VALL-E, SPEAR-TTS) exhibit strong zero-shot adaptation capabilities but suffer from phoneme mispronunciations, word skipping, and repetitions due to the absence of monotonic alignment constraints. To address this, we propose VALL-Tβ€”the first decoder-only TTS model incorporating generative transducer principles via offset-based relative positional encoding, which explicitly models phoneme-level monotonic alignment and implicitly regularizes the autoregressive generation process. Crucially, this design preserves full zero-shot speech prompting capability while substantially enhancing decoding controllability and robustness. Experiments demonstrate that VALL-T reduces word error rate by 28.3% and significantly mitigates diverse hallucination phenomena in speech synthesis. Our work establishes a new paradigm for high-fidelity, high-reliability end-to-end TTS systems.

Technology Category

Application Category

πŸ“ Abstract
Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot adaptation given a speech prompt. However, such decoder-only TTS models lack monotonic alignment constraints, sometimes leading to hallucination issues such as mispronunciation, word skipping and repeating. To address this limitation, we propose VALL-T, a generative Transducer model that introduces shifting relative position embeddings for input phoneme sequence, explicitly indicating the monotonic generation process while maintaining the architecture of decoder-only Transformer. Consequently, VALL-T retains the capability of prompt-based zero-shot adaptation and demonstrates better robustness against hallucinations with a relative reduction of 28.3% in the word error rate.
Problem

Research questions and friction points this paper is trying to address.

Addresses hallucination issues in decoder-only TTS models
Introduces monotonic alignment constraints for better robustness
Reduces word error rate by 28.3% in zero-shot adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoder-only Transformer with shifting position embeddings
Monotonic alignment for reduced hallucination issues
Zero-shot adaptation with improved word error rate
πŸ”Ž Similar Papers
No similar papers found.
Chenpeng Du
Chenpeng Du
ByteDance
Speech Interaction
Y
Yiwei Guo
MoE Key Lab of Artificial Intelligence, AI Institute ; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University
Y
Yifan Yang
MoE Key Lab of Artificial Intelligence, AI Institute ; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University
Zhikang Niu
Zhikang Niu
Shanghai Jiao Tong University
Speech Synthesis
S
Shuai Wang
Shenzhen Research Institute of Big Data
H
Hui Zhang
AISpeech Ltd.
X
Xie Chen
MoE Key Lab of Artificial Intelligence, AI Institute ; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University
K
Kai Yu
MoE Key Lab of Artificial Intelligence, AI Institute ; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University