🤖 AI Summary
Discrete speech tokens and continuous speech features exhibit distinct properties in SpeechLLMs, yet their comparative performance in spoken language understanding remains underexplored. Method: This work conducts the first systematic, fair comparison across multiple tasks (ASR, SLU, speech QA) and model scales (Qwen1.5-0.5B, Llama3.1-8B) within a unified experimental framework. Both input types are derived from self-supervised speech representations (SSL), and their characteristics—information encoding patterns, optimal SSL layer selection, LLM layer mapping strategies, and noise robustness—are rigorously analyzed. Contribution/Results: Continuous features consistently outperform discrete tokens across most tasks, demonstrating superior robustness to acoustic perturbations and stronger cross-model generalization. This study establishes the first empirical benchmark for speech representation design in SpeechLLMs and provides critical, evidence-based insights for architectural choices in spoken language modeling.
📝 Abstract
With the rise of Speech Large Language Models (SpeechLLMs), two dominant approaches have emerged for speech processing: discrete tokens and continuous features. Each approach has demonstrated strong capabilities in audio-related processing tasks. However, the performance gap between these two paradigms has not been thoroughly explored. To address this gap, we present a fair comparison of self-supervised learning (SSL)-based discrete and continuous features under the same experimental settings. We evaluate their performance across six spoken language understanding-related tasks using both small and large-scale LLMs (Qwen1.5-0.5B and Llama3.1-8B). We further conduct in-depth analyses, including efficient comparison, SSL layer analysis, LLM layer analysis, and robustness comparison. Our findings reveal that continuous features generally outperform discrete tokens in various tasks. Each speech processing method exhibits distinct characteristics and patterns in how it learns and processes speech information. We hope our results will provide valuable insights to advance spoken language understanding in SpeechLLMs.