🤖 AI Summary
Existing zero-shot autoregressive text-to-speech (TTS) systems face two key challenges: (1) a speed–quality trade-off—reducing frame rate degrades expressiveness, while increasing token density harms inference efficiency; and (2) supervision mismatch—cross-entropy loss ignores acoustic similarity among neighboring speech units. To address these, we propose BridgeCode, a dual speech representation paradigm: it employs sparse discrete tokens for efficient autoregressive generation while jointly reconstructing high-fidelity continuous acoustic features to preserve speech quality. We further design a two-level joint optimization objective—operating at both token and feature levels—to explicitly model acoustic proximity. Experiments demonstrate that BridgeCode achieves state-of-the-art naturalness and speaker similarity, while significantly accelerating synthesis. To our knowledge, it is the first zero-shot TTS framework to simultaneously resolve the tripartite challenges of efficiency, quality, and supervision mismatch.
📝 Abstract
Autoregressive (AR) frameworks have recently achieved remarkable progress in zero-shot text-to-speech (TTS) by leveraging discrete speech tokens and large language model techniques. Despite their success, existing AR-based zero-shot TTS systems face two critical limitations: (i) an inherent speed-quality trade-off, as sequential token generation either reduces frame rates at the cost of expressiveness or enriches tokens at the cost of efficiency, and (ii) a text-oriented supervision mismatch, as cross-entropy loss penalizes token errors uniformly without considering the fine-grained acoustic similarity among adjacent tokens. To address these challenges, we propose BridgeTTS, a novel AR-TTS framework built upon the dual speech representation paradigm BridgeCode. BridgeTTS reduces AR iterations by predicting sparse tokens while reconstructing rich continuous features for high-quality synthesis. Joint optimization of token-level and feature-level objectives further enhances naturalness and intelligibility. Experiments demonstrate that BridgeTTS achieves competitive quality and speaker similarity while significantly accelerating synthesis. Speech demos are available at https://test1562.github.io/demo/.