SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System

📅 2025-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor scalability, high computational cost, and reliance on grapheme-to-phoneme (G2P) conversion and external aligners in end-to-end TTS systems, this paper proposes a lightweight, fully character-level TTS framework. Methodologically, it (1) constructs a low-dimensional latent speech space using temporal compression and a ConvNeXt-based lightweight backbone; (2) employs flow matching for stable, invertible text-to-latent mapping; and (3) introduces sentence-level duration prediction with context-shared batch expansion to accelerate training convergence and improve phoneme-acoustic alignment stability. The model eliminates G2P and external alignment modules, drastically reducing parameter count and memory footprint while significantly improving inference speed. It achieves state-of-the-art speech quality with superior efficiency and scalability. Audio samples are publicly released.

Technology Category

Application Category

📝 Abstract
We present a novel text-to-speech (TTS) system, namely SupertonicTTS, for improved scalability and efficiency in speech synthesis. SupertonicTTS is comprised of three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. We further simplify the TTS pipeline by operating directly on raw character-level text and employing cross-attention for text-speech alignment, thus eliminating the need for grapheme-to-phoneme (G2P) modules and external aligners. In addition, we introduce context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment. Experimental results demonstrate that SupertonicTTS achieves competitive performance while significantly reducing architectural complexity and computational overhead compared to contemporary TTS models. Audio samples demonstrating the capabilities of SupertonicTTS are available at: https://supertonictts.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Improves scalability and efficiency in text-to-speech synthesis
Eliminates need for G2P modules and external aligners
Reduces architectural complexity and computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow-matching text-to-latent module
Raw character-level text processing
Context-sharing batch expansion
🔎 Similar Papers
No similar papers found.