DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high semantic–acoustic token redundancy and low generation efficiency in non-streaming text-to-speech (TTS), this paper proposes DiffSoundStream. First, it employs semantic tokens to guide a neural codec, compressing redundant semantic–acoustic representations. Second, it constructs a semantic-conditioned diffusion model in the latent space to enable high-fidelity waveform reconstruction with minimal sampling steps. The framework integrates self-supervised feature quantization, neural speech coding, semantic conditional modeling, and latent diffusion. Crucially, it achieves distillation in only four diffusion steps, with negligible audio quality degradation. Experiments demonstrate that at a low token rate of 50 tokens/s, DiffSoundStream matches the synthesis quality of SoundStream operating at 100 tokens/s—effectively doubling efficiency without perceptible quality loss. This significantly improves the efficiency–quality trade-off for non-streaming TTS synthesis.

Technology Category

Application Category

📝 Abstract
Token-based language modeling is a prominent approach for speech generation, where tokens are obtained by quantizing features from self-supervised learning (SSL) models and extracting codes from neural speech codecs, generally referred to as semantic tokens and acoustic tokens. These tokens are often modeled autoregressively, with the inference speed being constrained by the token rate. In this work, we propose DiffSoundStream, a solution that improves the efficiency of speech tokenization in non-streaming scenarios through two techniques: (1) conditioning the neural codec on semantic tokens to minimize redundancy between semantic and acoustic tokens, and (2) leveraging latent diffusion models to synthesize high-quality waveforms from semantic and coarse-level acoustic tokens. Experiments show that at 50 tokens per second, DiffSoundStream achieves speech quality on par with a standard SoundStream model operating at twice the token rate. Additionally, we achieve step-size distillation using just four diffusion sampling steps with only a minor quality loss.
Problem

Research questions and friction points this paper is trying to address.

Improve speech tokenization efficiency in non-streaming scenarios
Reduce redundancy between semantic and acoustic tokens
Synthesize high-quality waveforms with fewer diffusion steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditioning neural codec on semantic tokens
Using latent diffusion models for waveform synthesis
Achieving high quality with fewer diffusion steps
🔎 Similar Papers
2024-07-22arXiv.orgCitations: 4
Y
Yang Yang
Google LLC.
Y
Yunpeng Li
Google LLC.
George Sung
George Sung
Google Research
Computer VisionMachine Learning
S
Shao-Fu Shih
Google LLC.
C
Craig Dooley
Google LLC.
A
Alessio Centazzo
Google LLC.
R
Ramanan Rajeswaran
Google LLC.