Scaling Spoken Language Models with Syllabic Speech Tokenization

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the inefficiency of high-frame-rate speech tokenization, which yields excessively long sequences and high computational overhead. We propose syllable-level speech tokenization to enhance the efficiency of spoken language models. Our method employs a self-supervised syllable discretization encoder to map raw audio into compact syllable sequences, which are then fed into a Transformer architecture for spoken language understanding tasks. To our knowledge, this is the first systematic investigation validating the efficacy of syllable-level tokens in spoken language modeling—achieving both high compression (substantially reduced sequence length) and strong interpretability. Experiments across multi-scale training data show that our model matches or surpasses baseline performance while reducing training time by over 2× and inference FLOPs by 5×, significantly improving both training and inference efficiency.

Technology Category

Application Category

📝 Abstract
Spoken language models (SLMs) typically discretize speech into high-frame-rate tokens extracted from SSL speech models. As the most successful LMs are based on the Transformer architecture, processing these long token streams with self-attention is expensive, as attention scales quadratically with sequence length. A recent SSL work introduces acoustic tokenization of speech at the syllable level, which is more interpretable and potentially more scalable with significant compression in token lengths (4-5 Hz). Yet, their value for spoken language modeling is not yet fully explored. We present the first systematic study of syllabic tokenization for spoken language modeling, evaluating models on a suite of SLU benchmarks while varying training data scale. Syllabic tokens can match or surpass the previous high-frame rate tokens while significantly cutting training and inference costs, achieving more than a 2x reduction in training time and a 5x reduction in FLOPs. Our findings highlight syllable-level language modeling as a promising path to efficient long-context spoken language models.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs in spoken language models with syllable tokenization
Exploring syllable-level speech discretization for efficient language modeling
Achieving performance parity while cutting training and inference expenses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Syllabic tokenization replaces high-frame-rate speech tokens
Achieves training time reduction over 2x and FLOPs cut 5x
Enables efficient long-context spoken language modeling
🔎 Similar Papers
No similar papers found.