SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer

📅 2025-02-16

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

SyncSpeech addresses the challenge of low-latency speech synthesis under streaming text input by proposing the first end-to-end TTS framework enabling synchronous, dual-stream generation of text and speech. Methodologically, it introduces: (1) a time-masked Transformer architecture for incremental modeling; (2) a token-level joint duration prediction mechanism to ensure prosodic accuracy; and (3) asynchronous dual-stream encoding-decoding coupled with a two-stage curriculum learning strategy to jointly optimize real-time performance and audio quality. Evaluated on multiple English and Chinese datasets, SyncSpeech achieves ultra-low latency—initiating synthesis upon arrival of the *second* text token—with significantly improved real-time factor. Its speech quality and robustness match those of autoregressive TTS models. This work provides an efficient, low-latency streaming TTS solution tailored for large-model–driven voice interaction systems.

Technology Category

Application Category

📝 Abstract

This paper presents a dual-stream text-to-speech (TTS) model, SyncSpeech, capable of receiving streaming text input from upstream models while simultaneously generating streaming speech, facilitating seamless interaction with large language models. SyncSpeech has the following advantages: Low latency, as it begins generating streaming speech upon receiving the second text token; High efficiency, as it decodes all speech tokens corresponding to the each arrived text token in one step. To achieve this, we propose a temporal masked transformer as the backbone of SyncSpeech, combined with token-level duration prediction to predict speech tokens and the duration for the next step. Additionally, we design a two-stage training strategy to improve training efficiency and the quality of generated speech. We evaluated the SyncSpeech on both English and Mandarin datasets. Compared to the recent dual-stream TTS models, SyncSpeech significantly reduces the first packet delay of speech tokens and accelerates the real-time factor. Moreover, with the same data scale, SyncSpeech achieves performance comparable to that of traditional autoregressive-based TTS models in terms of both speech quality and robustness. Speech samples are available at https://SyncSpeech.github.io/}{https://SyncSpeech.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Develops low-latency text-to-speech model

Enhances dual-stream TTS efficiency

Improves real-time speech generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream TTS model

Temporal masked transformer

Token-level duration prediction

🔎 Similar Papers

SSR: Alignment-Aware Modality Connector for Speech Language Models