Recent Advances in Discrete Speech Tokens: A Review

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses core challenges in adopting discrete speech tokens as fundamental acoustic representations for large language models (LLMs). It systematically surveys two dominant paradigms—acoustic tokens and semantic tokens—and establishes the first unified taxonomy. Leveraging representative models including VQ-VAE, HuBERT, WavLM, and SpeechTokenizer, the study conducts cross-paradigm comparisons grounded in information-theoretic analysis, comprehensive multi-task evaluation, and ablation studies. Results reveal that semantic tokens facilitate more effective joint modeling with LLMs, whereas acoustic tokens achieve superior waveform fidelity; yet both exhibit fundamental limitations in robustness, generalization, and text–speech alignment. To bridge this gap, the paper proposes a novel hybrid tokenization framework that jointly optimizes information capacity and controllability. This approach provides both theoretical foundations and practical guidelines for unified speech–language modeling.

Technology Category

Application Category

📝 Abstract
The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens.
Problem

Research questions and friction points this paper is trying to address.

Review advances in discrete speech tokens
Analyze strengths and limitations of token types
Propose future research directions for speech tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete speech tokens integration
Acoustic and semantic tokens classification
Systematic experimental token comparisons
🔎 Similar Papers
No similar papers found.
Y
Yiwei Guo
MoE Key Lab of Artificial Intelligence, AI Institute; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
Zhihan Li
Zhihan Li
Kuaishou Technology, Tsinghua University
Anomaly DetectionAIOps
Hankun Wang
Hankun Wang
Shanghai Jiao Tong University
Speech Synthesis
B
Bohan Li
MoE Key Lab of Artificial Intelligence, AI Institute; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
Chongtian Shao
Chongtian Shao
Shanghai Jiao Tong University
natural language processingspeech processingcomputational linguistics
Hanglei Zhang
Hanglei Zhang
Shanghai Jiao Tong University
Chenpeng Du
Chenpeng Du
ByteDance
Speech Interaction
X
Xie Chen
MoE Key Lab of Artificial Intelligence, AI Institute; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
S
Shujie Liu
Microsoft Research Asia (MSRA), Beijing 100080, China
K
Kai Yu
MoE Key Lab of Artificial Intelligence, AI Institute; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China