Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Discrete speech tokens and continuous speech features exhibit distinct properties in SpeechLLMs, yet their comparative performance in spoken language understanding remains underexplored. Method: This work conducts the first systematic, fair comparison across multiple tasks (ASR, SLU, speech QA) and model scales (Qwen1.5-0.5B, Llama3.1-8B) within a unified experimental framework. Both input types are derived from self-supervised speech representations (SSL), and their characteristics—information encoding patterns, optimal SSL layer selection, LLM layer mapping strategies, and noise robustness—are rigorously analyzed. Contribution/Results: Continuous features consistently outperform discrete tokens across most tasks, demonstrating superior robustness to acoustic perturbations and stronger cross-model generalization. This study establishes the first empirical benchmark for speech representation design in SpeechLLMs and provides critical, evidence-based insights for architectural choices in spoken language modeling.

Technology Category

Application Category

📝 Abstract

With the rise of Speech Large Language Models (SpeechLLMs), two dominant approaches have emerged for speech processing: discrete tokens and continuous features. Each approach has demonstrated strong capabilities in audio-related processing tasks. However, the performance gap between these two paradigms has not been thoroughly explored. To address this gap, we present a fair comparison of self-supervised learning (SSL)-based discrete and continuous features under the same experimental settings. We evaluate their performance across six spoken language understanding-related tasks using both small and large-scale LLMs (Qwen1.5-0.5B and Llama3.1-8B). We further conduct in-depth analyses, including efficient comparison, SSL layer analysis, LLM layer analysis, and robustness comparison. Our findings reveal that continuous features generally outperform discrete tokens in various tasks. Each speech processing method exhibits distinct characteristics and patterns in how it learns and processes speech information. We hope our results will provide valuable insights to advance spoken language understanding in SpeechLLMs.

Problem

Research questions and friction points this paper is trying to address.

Compares discrete tokens versus continuous features for speech processing

Evaluates performance gap in spoken language understanding tasks

Analyzes learning patterns across different SSL and LLM layers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparative analysis of discrete tokens and continuous features

Evaluation across six spoken language understanding tasks

Continuous features outperform discrete tokens in performance

🔎 Similar Papers

dMel: Speech Tokenization made Simple