🤖 AI Summary
Existing speech tokenizers struggle to simultaneously achieve strong semantic understanding and high-fidelity acoustic reconstruction, particularly under low bitrates and low token rates. This work proposes SiTok, a speech diffusion tokenizer that, for the first time, integrates a diffusion autoencoder into speech tokenization. By unifying supervised semantic learning with a diffusion-based reconstruction mechanism within a single framework, SiTok effectively balances semantic richness and audio fidelity. Trained on 2 million hours of speech data with a 1.6B-parameter model, SiTok operates at an extremely low token rate of 12.5 Hz and a bitrate of 200 bps, significantly outperforming existing methods across speech understanding, reconstruction, and generation tasks.
📝 Abstract
Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of $12.5$ Hz and a bit-rate of 200 bits-per-second.