Scaling Speech Tokenizers with Diffusion Autoencoders

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing speech tokenizers struggle to simultaneously achieve strong semantic understanding and high-fidelity acoustic reconstruction, particularly under low bitrates and low token rates. This work proposes SiTok, a speech diffusion tokenizer that, for the first time, integrates a diffusion autoencoder into speech tokenization. By unifying supervised semantic learning with a diffusion-based reconstruction mechanism within a single framework, SiTok effectively balances semantic richness and audio fidelity. Trained on 2 million hours of speech data with a 1.6B-parameter model, SiTok operates at an extremely low token rate of 12.5 Hz and a bitrate of 200 bps, significantly outperforming existing methods across speech understanding, reconstruction, and generation tasks.

Technology Category

Application Category

📝 Abstract

Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of $12.5$ Hz and a bit-rate of 200 bits-per-second.

Problem

Research questions and friction points this paper is trying to address.

speech tokenizers

semantic-acoustic trade-off

low bit rate

low token rate

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion autoencoder

speech tokenizer

low token rate

semantic-acoustic representation