BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing zero-shot autoregressive text-to-speech (TTS) systems face two key challenges: (1) a speed–quality trade-off—reducing frame rate degrades expressiveness, while increasing token density harms inference efficiency; and (2) supervision mismatch—cross-entropy loss ignores acoustic similarity among neighboring speech units. To address these, we propose BridgeCode, a dual speech representation paradigm: it employs sparse discrete tokens for efficient autoregressive generation while jointly reconstructing high-fidelity continuous acoustic features to preserve speech quality. We further design a two-level joint optimization objective—operating at both token and feature levels—to explicitly model acoustic proximity. Experiments demonstrate that BridgeCode achieves state-of-the-art naturalness and speaker similarity, while significantly accelerating synthesis. To our knowledge, it is the first zero-shot TTS framework to simultaneously resolve the tripartite challenges of efficiency, quality, and supervision mismatch.

Technology Category

Application Category

📝 Abstract

Autoregressive (AR) frameworks have recently achieved remarkable progress in zero-shot text-to-speech (TTS) by leveraging discrete speech tokens and large language model techniques. Despite their success, existing AR-based zero-shot TTS systems face two critical limitations: (i) an inherent speed-quality trade-off, as sequential token generation either reduces frame rates at the cost of expressiveness or enriches tokens at the cost of efficiency, and (ii) a text-oriented supervision mismatch, as cross-entropy loss penalizes token errors uniformly without considering the fine-grained acoustic similarity among adjacent tokens. To address these challenges, we propose BridgeTTS, a novel AR-TTS framework built upon the dual speech representation paradigm BridgeCode. BridgeTTS reduces AR iterations by predicting sparse tokens while reconstructing rich continuous features for high-quality synthesis. Joint optimization of token-level and feature-level objectives further enhances naturalness and intelligibility. Experiments demonstrate that BridgeTTS achieves competitive quality and speaker similarity while significantly accelerating synthesis. Speech demos are available at https://test1562.github.io/demo/.

Problem

Research questions and friction points this paper is trying to address.

Addresses speed-quality tradeoff in zero-shot TTS synthesis

Resolves text-oriented supervision mismatch in token prediction

Enables efficient high-quality speech with dual representation paradigm

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual speech representation paradigm BridgeCode

Predicts sparse tokens with continuous features

Joint token and feature level optimization

🔎 Similar Papers

AccentBox: Towards High-Fidelity Zero-Shot Accent Generation