Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot text-to-speech (TTS) models struggle to capture the complex coupling between acoustic and semantic features, resulting in limited expressiveness and low speaker similarity. To address this, we propose a novel autoregressive–non-autoregressive collaborative zero-shot TTS framework. Our method introduces a Parallel Tokenizer to jointly generate discrete semantic and acoustic tokens; designs a coupled non-autoregressive decoder to explicitly model their interdependence; and incorporates a cross-modal feature alignment mechanism for hierarchical fusion. Built upon large language model architecture, the framework balances modeling capacity with inference efficiency. Extensive experiments on multiple Chinese and English datasets demonstrate significant improvements over state-of-the-art methods: higher naturalness and speaker similarity, along with faster synthesis speed. This work establishes a new paradigm for high-quality zero-shot TTS.

Technology Category

Application Category

📝 Abstract
Advances in speech representation and large language models have enhanced zero-shot text-to-speech (TTS) performance. However, existing zero-shot TTS models face challenges in capturing the complex correlations between acoustic and semantic features, resulting in a lack of expressiveness and similarity. The primary reason lies in the complex relationship between semantic and acoustic features, which manifests independent and interdependent aspects.This paper introduces a TTS framework that combines both autoregressive (AR) and non-autoregressive (NAR) modules to harmonize the independence and interdependence of acoustic and semantic information. The AR model leverages the proposed Parallel Tokenizer to synthesize the top semantic and acoustic tokens simultaneously. In contrast, considering the interdependence, the Coupled NAR model predicts detailed tokens based on the general AR model's output. Parallel GPT, built on this architecture, is designed to improve zero-shot text-to-speech synthesis through its parallel structure. Experiments on English and Chinese datasets demonstrate that the proposed model significantly outperforms the quality and efficiency of the synthesis of existing zero-shot TTS models. Speech demos are available at https://t1235-ch.github.io/pgpt/.
Problem

Research questions and friction points this paper is trying to address.

Balancing acoustic and semantic feature independence in TTS
Improving expressiveness and similarity in zero-shot TTS models
Harmonizing AR and NAR modules for better speech synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines AR and NAR modules for TTS
Uses Parallel Tokenizer for semantic and acoustic tokens
Improves zero-shot TTS with parallel structure
🔎 Similar Papers
No similar papers found.
J
Jingyuan Xing
School of Future Technology, South China University of Technology, Guangzhou 510640, China
Zhipeng Li
Zhipeng Li
Natioinal Institute of Standards and Technology, USA
TEMsolid oxide fuel celllithium ion batterythin filmsemiconductor
J
Jialong Mai
School of Electronic and Information, South China University of Technology, Guangzhou 510640, China
Xiaofen Xing
Xiaofen Xing
South China University of Technology
Xiangmin Xu
Xiangmin Xu
South China University of Technology