Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Zero-shot text-to-speech (TTS) faces an inherent trade-off: autoregressive (AR) models suffer from slow inference and limited duration control, while non-autoregressive (NAR) models exhibit weak temporal modeling and rely on complex architectural designs. To resolve this, we propose pseudo-autoregressive (PAR) encoder-decoder language modeling—a paradigm unifying AR’s strong sequential modeling with NAR’s parallel efficiency. Our two-stage TTS system, PALLE, first employs PAR to generate variable-length speech tokens via coarse-grained selection, then refines them iteratively using confidence-guided NAR decoding. Integrating neural audio codecs with conditional sequence modeling, PALLE enables global-context-aware stepwise generation and refinement. On LibriSpeech test-clean, PALLE achieves state-of-the-art performance—surpassing F5-TTS, E2-TTS, and others—in speech quality, speaker similarity, intelligibility, and inference speed (10× faster).

Technology Category

Application Category

📝 Abstract

Recent zero-shot text-to-speech (TTS) systems face a common dilemma: autoregressive (AR) models suffer from slow generation and lack duration controllability, while non-autoregressive (NAR) models lack temporal modeling and typically require complex designs. In this paper, we introduce a novel pseudo-autoregressive (PAR) codec language modeling approach that unifies AR and NAR modeling. Combining explicit temporal modeling from AR with parallel generation from NAR, PAR generates dynamic-length spans at fixed time steps. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. In the first stage, PAR progressively generates speech tokens along the time dimension, with each step predicting all positions in parallel but only retaining the left-most span. In the second stage, low-confidence tokens are iteratively refined in parallel, leveraging the global contextual information. Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data, including F5-TTS, E2-TTS, and MaskGCT, on the LibriSpeech test-clean set in terms of speech quality, speaker similarity, and intelligibility, while achieving up to ten times faster inference speed. Audio samples are available at https://anonymous-palle.github.io.

Problem

Research questions and friction points this paper is trying to address.

Slow generation in autoregressive TTS models

Lack duration control in zero-shot TTS

Complex designs in non-autoregressive TTS models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pseudo-autoregressive modeling unifies AR and NAR

Two-stage system with PAR and NAR refinement

Dynamic-length spans at fixed time steps

🔎 Similar Papers

No similar papers found.