Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Autoregressive (AR) LLM-based text-to-speech (TTS) systems face three key challenges: (1) single-codebook acoustic modeling incurs significant speech information loss; (2) hierarchical Residual Vector Quantization (RVQ) tokens lack explicit semantic structure, increasing modeling complexity; and (3) error accumulation in AR generation degrades synthesis stability. To address these, we propose CaT-TTS—a novel “understand-then-generate” dual-language modeling paradigm. It introduces S3Codec, a semantics-aware speech encoder that compresses acoustic features into a high-level semantic primary codebook via semantic distillation. A dual-Transformer architecture explicitly decouples speech understanding from waveform generation. Furthermore, Masked Audio Parallel Inference (MAPI) enables non-autoregressive inference, effectively suppressing error propagation. Experiments on zero-shot TTS demonstrate that CaT-TTS achieves superior semantic alignment, enhanced synthesis stability, and higher audio fidelity than state-of-the-art AR approaches.

Technology Category

Application Category

📝 Abstract
Existing Large Language Model (LLM) based autoregressive (AR) text-to-speech (TTS) systems, while achieving state-of-the-art quality, still face critical challenges. The foundation of this LLM-based paradigm is the discretization of the continuous speech waveform into a sequence of discrete tokens by neural audio codec. However, single codebook modeling is well suited to text LLMs, but suffers from significant information loss; hierarchical acoustic tokens, typically generated via Residual Vector Quantization (RVQ), often lack explicit semantic structure, placing a heavy learning burden on the model. Furthermore, the autoregressive process is inherently susceptible to error accumulation, which can degrade generation stability. To address these limitations, we propose CaT-TTS, a novel framework for robust and semantically-grounded zero-shot synthesis. First, we introduce S3Codec, a split RVQ codec that injects explicit linguistic features into its primary codebook via semantic distillation from a state-of-the-art ASR model, providing a structured representation that simplifies the learning task. Second, we propose an ``Understand-then-Generate'' dual-Transformer architecture that decouples comprehension from rendering. An initial ``Understanding'' Transformer models the cross-modal relationship between text and the audio's semantic tokens to form a high-level utterance plan. A subsequent ``Generation'' Transformer then executes this plan, autoregressively synthesizing hierarchical acoustic tokens. Finally, to enhance generation stability, we introduce Masked Audio Parallel Inference (MAPI), a nearly parameter-free inference strategy that dynamically guides the decoding process to mitigate local errors.
Problem

Research questions and friction points this paper is trying to address.

Reduces information loss in speech token discretization
Decouples semantic comprehension from acoustic generation
Mitigates error accumulation in autoregressive speech synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic distillation enhances codebook linguistic structure
Dual-Transformer decouples comprehension from audio rendering
Masked parallel inference mitigates autoregressive error accumulation
🔎 Similar Papers
No similar papers found.
Junjie Cao
Junjie Cao
School of Mathematical Sciences, Dalian University of Technology
Computer GraphicsComputer VisionMachine Learning
Y
Yichen Han
AMAP Speech
R
Ruonan Zhang
Tsinghua University
Xiaoyang Hao
Xiaoyang Hao
Tencent
speech synthesis
H
Hongxiang Li
Tsinghua University
Shuaijiang Zhao
Shuaijiang Zhao
KE DIDI BAIDU PKU
Speech LLM
Y
Yue Liu
AMAP Speech
X
Xiao-Ping Zhang
Tsinghua University