LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work proposes a non-autoregressive diffusion-based text-to-speech (TTS) model that directly synthesizes speech in the waveform latent space, bypassing intermediate acoustic representations and thereby avoiding cumulative errors and pipeline complexity. The approach relies solely on a waveform variational autoencoder (Wav-VAE) and a diffusion backbone, enhanced with an adaptive projection guidance mechanism that replaces conventional classifier-free guidance to effectively mitigate train-inference mismatch. To the best of our knowledge, this is the first method to achieve high-quality TTS entirely within the waveform latent space. It sets a new state-of-the-art in zero-shot voice cloning on the Seed benchmark: LongCat-AudioDiT-3.5B attains speaker similarity scores of 0.818 and 0.797 on Seed-ZH and Seed-Hard, respectively, while preserving high intelligibility.

Technology Category

Application Category

📝 Abstract

We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-AudioDiT-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.

Problem

Research questions and friction points this paper is trying to address.

text-to-speech

waveform latent space

diffusion model

zero-shot voice cloning

high-fidelity synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

waveform latent space

diffusion TTS

adaptive projection guidance