🤖 AI Summary
This study addresses the inefficiency of high-frame-rate speech tokenization, which yields excessively long sequences and high computational overhead. We propose syllable-level speech tokenization to enhance the efficiency of spoken language models. Our method employs a self-supervised syllable discretization encoder to map raw audio into compact syllable sequences, which are then fed into a Transformer architecture for spoken language understanding tasks. To our knowledge, this is the first systematic investigation validating the efficacy of syllable-level tokens in spoken language modeling—achieving both high compression (substantially reduced sequence length) and strong interpretability. Experiments across multi-scale training data show that our model matches or surpasses baseline performance while reducing training time by over 2× and inference FLOPs by 5×, significantly improving both training and inference efficiency.
📝 Abstract
Spoken language models (SLMs) typically discretize speech into high-frame-rate tokens extracted from SSL speech models. As the most successful LMs are based on the Transformer architecture, processing these long token streams with self-attention is expensive, as attention scales quadratically with sequence length. A recent SSL work introduces acoustic tokenization of speech at the syllable level, which is more interpretable and potentially more scalable with significant compression in token lengths (4-5 Hz). Yet, their value for spoken language modeling is not yet fully explored. We present the first systematic study of syllabic tokenization for spoken language modeling, evaluating models on a suite of SLU benchmarks while varying training data scale. Syllabic tokens can match or surpass the previous high-frame rate tokens while significantly cutting training and inference costs, achieving more than a 2x reduction in training time and a 5x reduction in FLOPs. Our findings highlight syllable-level language modeling as a promising path to efficient long-context spoken language models.