Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection

📅 2026-01-28
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability and high sensitivity to prompt selection in existing large language model (LLM)-based speech recognition approaches that rely on fixed, manually crafted prompts. To overcome this limitation, the authors propose a model-agnostic, learnable prompt projection module that adaptively maps prompt embeddings into more effective regions of the LLM’s input space, without modifying the underlying LLM architecture. This approach significantly reduces prompt sensitivity and enhances recognition robustness and consistency. Experimental results across four benchmark datasets demonstrate that the proposed method not only consistently outperforms the best handcrafted prompts but also substantially mitigates performance variance, yielding more reliable and stable recognition outcomes.

Technology Category

Application Category

📝 Abstract
LLM-based automatic speech recognition (ASR), a well-established approach, connects speech foundation models to large language models (LLMs) through a speech-to-LLM projector, yielding promising results. A common design choice in these architectures is the use of a fixed, manually defined prompt during both training and inference. This setup not only enables applicability across a range of practical scenarios, but also helps maximize model performance. However, the impact of prompt design remains underexplored. This paper presents a comprehensive analysis of commonly used prompts across diverse datasets, showing that prompt choice significantly affects ASR performance and introduces instability, with no single prompt performing best across all cases. Inspired by the speech-to-LLM projector, we propose a prompt projector module, a simple, model-agnostic extension that learns to project prompt embeddings to more effective regions of the LLM input space, without modifying the underlying LLM-based ASR model. Experiments on four datasets show that the addition of a prompt projector consistently improves performance, reduces variability, and outperforms the best manually selected prompts.
Problem

Research questions and friction points this paper is trying to address.

prompt sensitivity
LLM-based ASR
automatic speech recognition
prompt design
performance instability
Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt projector
learnable projection
prompt sensitivity
LLM-based ASR
speech-to-LLM alignment
🔎 Similar Papers
No similar papers found.
S
Sergio Gastón Burdisso
Idiap Research Institute
E
Esaú Villatoro-Tello
Idiap Research Institute
Shashi Kumar
Shashi Kumar
PhD student@Idiap Research Institute, Switzerland | EPFL, Switzerland
Automatic Speech RecognitionMultitask learningStreaming ASRLLM-ASR
S
S. Madikeri
University of Zurich
A
Andrés Carofilis
Idiap Research Institute
Pradeep Rangappa
Pradeep Rangappa
Senior Speech Applied Scientist (Remote) @Omilia | Postdoc Idiap | Ex- Swiggy | PhD IIT Kharagpur
Speech RecognitionMachine LearningSpeaker Diarization
E
E. ManjunathK
Uniphore
K
Kadri Hacioglu
Uniphore
P
Petr Motlícek
Idiap Research Institute, Brno University of Technology
A
A. Stolcke
Uniphore