Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

📅 2024-11-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the robustness of SLAM-ASR in realistic, complex speech scenarios, focusing on critical challenges including cross-domain transfer, speaking-rate variation, and noise corruption. Through multi-dimensional ablation studies and cross-domain generalization tests, we first uncover its performance degradation mechanism: the model exhibits low sensitivity to accent discrepancies but suffers significant deterioration under speaking-rate shifts and additive noise—exhibiting over 30% WER increase out-of-domain. We propose a novel speech encoder–LLM co-architecture leveraging linear adapters to enhance interoperability, and establish a robust configuration principle balancing data characteristics and computational constraints. Experiments demonstrate that this principle improves cross-domain WER stability by up to 22%. Our work provides an interpretable diagnostic framework and a practically deployable optimization pathway for LLM-based ASR systems.

Technology Category

Application Category

📝 Abstract
Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations on in-domain data, such as changes in speech rate or additive noise, can significantly degrade performance. Our findings offer critical insights for fine-tuning and configuring robust LLM-based ASR models, tailored to different data characteristics and computational resources.
Problem

Research questions and friction points this paper is trying to address.

SLAM-ASR Performance
Environmental Conditions
Real-world Limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

SLAM-ASR Method
Environmental Robustness
Speech Recognition Adaptability
🔎 Similar Papers
No similar papers found.
S
Shashi Kumar
Idiap Research Institute, Switzerland; EPFL, Switzerland
I
Iuliia Thorbecke
Idiap Research Institute, Switzerland; University of Zurich, Switzerland
Sergio Burdisso
Sergio Burdisso
Researcher, Idiap Research Institute
artificial intelligencemachine learningnatural language processing
E
Esaú Villatoro-Tello
Idiap Research Institute, Switzerland
E
E. ManjunathK
Uniphore, U.S.A. & India
K
Kadri Hacioglu
Uniphore, U.S.A. & India
Pradeep Rangappa
Pradeep Rangappa
Senior Speech Applied Scientist (Remote) @Omilia | Postdoc Idiap | Ex- Swiggy | PhD IIT Kharagpur
Speech RecognitionMachine LearningSpeaker Diarization
Petr Motlicek
Petr Motlicek
Idiap Research Institute
Artificial intelligencespeech and signal processingmachine learning
A
A. Ganapathiraju
Uniphore, U.S.A. & India
Andreas Stolcke
Andreas Stolcke
Distinguished AI Scientist, Uniphore
Speech Processing