DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

To address high semantic–acoustic token redundancy and low generation efficiency in non-streaming text-to-speech (TTS), this paper proposes DiffSoundStream. First, it employs semantic tokens to guide a neural codec, compressing redundant semantic–acoustic representations. Second, it constructs a semantic-conditioned diffusion model in the latent space to enable high-fidelity waveform reconstruction with minimal sampling steps. The framework integrates self-supervised feature quantization, neural speech coding, semantic conditional modeling, and latent diffusion. Crucially, it achieves distillation in only four diffusion steps, with negligible audio quality degradation. Experiments demonstrate that at a low token rate of 50 tokens/s, DiffSoundStream matches the synthesis quality of SoundStream operating at 100 tokens/s—effectively doubling efficiency without perceptible quality loss. This significantly improves the efficiency–quality trade-off for non-streaming TTS synthesis.

Technology Category

Application Category

📝 Abstract

Token-based language modeling is a prominent approach for speech generation, where tokens are obtained by quantizing features from self-supervised learning (SSL) models and extracting codes from neural speech codecs, generally referred to as semantic tokens and acoustic tokens. These tokens are often modeled autoregressively, with the inference speed being constrained by the token rate. In this work, we propose DiffSoundStream, a solution that improves the efficiency of speech tokenization in non-streaming scenarios through two techniques: (1) conditioning the neural codec on semantic tokens to minimize redundancy between semantic and acoustic tokens, and (2) leveraging latent diffusion models to synthesize high-quality waveforms from semantic and coarse-level acoustic tokens. Experiments show that at 50 tokens per second, DiffSoundStream achieves speech quality on par with a standard SoundStream model operating at twice the token rate. Additionally, we achieve step-size distillation using just four diffusion sampling steps with only a minor quality loss.

Problem

Research questions and friction points this paper is trying to address.

Improve speech tokenization efficiency in non-streaming scenarios

Reduce redundancy between semantic and acoustic tokens

Synthesize high-quality waveforms with fewer diffusion steps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditioning neural codec on semantic tokens

Using latent diffusion models for waveform synthesis

Achieving high quality with fewer diffusion steps

🔎 Similar Papers

dMel: Speech Tokenization made Simple