StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current semantic speech tokenizers exhibit high sensitivity to meaningless acoustic perturbations, yielding unstable outputs even under high signal-to-noise ratios—thereby significantly increasing the learning burden on downstream SpeechLLMs. To address this, we propose a consensus-driven multi-branch tokenization architecture that employs parallel encoding paths and a bit-level bitwise voting mechanism to mitigate the fragility of single-path quantization and alleviate sparse training signal issues. Integrated with targeted optimization strategies, our model achieves stable and robust semantic tokenization across diverse noise conditions. Experiments demonstrate that the proposed method substantially reduces the Unit Edit Distance (UED) to the state-of-the-art level. Furthermore, it markedly enhances the noise robustness and generalization capability of SpeechLLMs on downstream tasks—including ASR and speech understanding—without architectural modifications to the LMs themselves.

Technology Category

Application Category

📝 Abstract
Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses fragility of semantic speech tokenizers to acoustic noise
Improves token stability under diverse noise conditions
Enhances robustness of SpeechLLMs for various tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-branch architecture processes audio in parallel
Bit-wise voting merges representations for stable tokens
Consensus-driven mechanism enhances noise robustness
🔎 Similar Papers
No similar papers found.
Y
Yuhan Song
State Key Laboratory of Multimedia Information Processing, Peking University
L
Linhao Zhang
Pattern Recognition Center, WeChat AI, Tencent Inc
Chuhan Wu
Chuhan Wu
WeChat AI, Tencent
Foundation ModelPretrainingPost TrainingLLM Agent
Aiwei Liu
Aiwei Liu
Tsinghua University
Natural Language ProcessingLarge Language modelsAI SafetyWatermarking
W
Wei Jia
Pattern Recognition Center, WeChat AI, Tencent Inc
H
Houfeng Wang
State Key Laboratory of Multimedia Information Processing, Peking University
Xiao Zhou
Xiao Zhou
M.Phil student in HKUST
Autonomous DrivingDRL