LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current speech tokenization methods produce excessively long and semantically redundant token sequences, hindering efficient cross-modal modeling in speech-language models; direct frame-rate reduction, meanwhile, often degrades semantic structure integrity and alignment quality. To address this, we propose an ASR-encoder-supervised indirect semantic distillation framework that abandons conventional frame pooling. Instead, it jointly optimizes an improved encoder-decoder architecture with a semantic quantizer, guided by a frozen ASR encoder and constrained by waveform reconstruction loss. This yields compact, language-model-friendly discrete speech tokens supporting multi-granularity frame rates (6.25–25 Hz). Experiments demonstrate significantly higher speech reconstruction fidelity than baselines. Speech-language models trained on our tokens achieve state-of-the-art performance on TTS tasks and competitive results on STT tasks.

Technology Category

Application Category

📝 Abstract
With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models. In particular, previous methods use SSL teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, they still produce speech token sequences significantly longer than their textual counterparts, creating challenges for efficient speech-language modeling. Reducing the frame rate is a natural solution, but standard techniques, such as rigid average pooling across frames, can distort or dilute the semantic structure required for effective LM alignment. To address this, we propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation. Instead of directly matching teacher and student features via pooling, we reconstruct speech solely from semantic tokens and minimize the discrepancy between the encoded representations of the original and reconstructed waveforms, obtained from a frozen automatic speech recognition (ASR) encoder. This indirect yet data-driven supervision enables the tokenizer to learn discrete units that are more semantically aligned with language models. LM-SPT further incorporates architectural improvements to the encoder and decoder for speech tokenization, and supports multiple frame rates, including 25Hz, 12.5Hz, and 6.25Hz. Experimental results show that LM-SPT achieves superior reconstruction fidelity compared to baselines, and that SLMs trained with LM-SPT tokens achieve competitive performances on speech-to-text and consistently outperform baselines on text-to-speech tasks.
Problem

Research questions and friction points this paper is trying to address.

Reduce speech token sequence length mismatch with text
Improve semantic alignment between speech tokens and language models
Enhance frame rate flexibility without distorting semantic structure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic distillation via reconstructed speech discrepancy
Architectural improvements for encoder and decoder
Supports multiple frame rates including 25Hz
🔎 Similar Papers
No similar papers found.
D
DaeJin Jo
Kakao, Korea University
J
Jeeyoung Yun
Korea University
Sungwoong Kim
Sungwoong Kim
Associate Professor, Korea University
artificial general intelligence