Scaling Speech Tokenizers with Diffusion Autoencoders

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech tokenizers struggle to simultaneously achieve strong semantic understanding and high-fidelity acoustic reconstruction, particularly under low bitrates and low token rates. This work proposes SiTok, a speech diffusion tokenizer that, for the first time, integrates a diffusion autoencoder into speech tokenization. By unifying supervised semantic learning with a diffusion-based reconstruction mechanism within a single framework, SiTok effectively balances semantic richness and audio fidelity. Trained on 2 million hours of speech data with a 1.6B-parameter model, SiTok operates at an extremely low token rate of 12.5 Hz and a bitrate of 200 bps, significantly outperforming existing methods across speech understanding, reconstruction, and generation tasks.

Technology Category

Application Category

📝 Abstract
Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of $12.5$ Hz and a bit-rate of 200 bits-per-second.
Problem

Research questions and friction points this paper is trying to address.

speech tokenizers
semantic-acoustic trade-off
low bit rate
low token rate
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion autoencoder
speech tokenizer
low token rate
semantic-acoustic representation
scalable speech modeling
🔎 Similar Papers
No similar papers found.