🤖 AI Summary
Existing discrete speech representations typically tokenize 16 kHz waveforms at fixed rates (25/50 tokens/s), yielding fine-grained, redundant sequences with high computational cost and excessive low-level semantic information. To address this, we propose an entropy-driven dynamic aggregation framework that adaptively segments speech by predicting local information entropy, thereby learning coarse-grained, semantically dense compressed representations. Our method breaks free from fixed-bitrate constraints and jointly models intra-segment semantics via cross-attention. We pretrain a speech-language model on large-scale unlabeled data using next-token prediction as a proxy task. Evaluated on ASR, speech translation, and voice conversion, the compressed representations achieve 2–4× sequence length reduction while matching or surpassing the performance of original dense token sequences—demonstrating strong semantic fidelity and cross-task generalization.
📝 Abstract
Discrete speech representation learning has recently attracted increasing interest in both acoustic and semantic modeling. Existing approaches typically encode 16 kHz waveforms into discrete tokens at a rate of 25 or 50 tokens per second. However, given that speech generally conveys only 2 to 5 words per second, such fine-grained tokenization introduces redundancy and hinders efficiency in downstream training and inference. Moreover, semantic speech representations at this frequency primarily capture phonetic-level information, while semantic understanding may not require such detailed token-level resolution. To address these limitations, we propose an entropy-based dynamic aggregation framework for learning compressed semantic speech representations. A speech language model is first pre-trained via next-token prediction on large-scale unlabeled data to capture frequent token patterns. Predictive entropy is then used to adaptively determine aggregation boundaries, followed by a cross-attention module that fuses information within each segment. By adjusting the entropy threshold, the granularity and compression ratio of the representations can be flexibly controlled. Experiments on ASR, speech-to-text translation, and voice conversion tasks demonstrate that the compressed representations perform on par with or better than dense token sequences, demonstrating the effectiveness of the proposed approach.