🤖 AI Summary
This work proposes Index-ASR, a novel approach to large language model (LLM)-based automatic speech recognition (ASR) that addresses the challenges of hallucination errors and limited contextual customization. By integrating an anti-hallucination mechanism with fine-grained hotword guidance for the first time in LLM-driven ASR, and leveraging large-scale training data augmented with background noise and contextual enhancements, Index-ASR substantially improves recognition robustness and adaptability across diverse scenarios. Experimental results demonstrate that the proposed method achieves state-of-the-art performance on both public benchmarks and internal test sets, confirming its effectiveness and practical utility in real-world applications.
📝 Abstract
Automatic speech recognition (ASR) has witnessed remarkable progress in recent years, largely driven by the emergence of LLM-based ASR paradigm. Despite their strong performance on a variety of open-source benchmarks, existing LLM-based ASR systems still suffer from two critical limitations. First, they are prone to hallucination errors, often generating excessively long and repetitive outputs that are not well grounded in the acoustic input. Second, they provide limited support for flexible and fine-grained contextual customization. To address these challenges, we propose Index-ASR, a large-scale LLM-based ASR system designed to simultaneously enhance robustness and support customizable hotword recognition. The core idea of Index-ASR lies in the integration of LLM and large-scale training data enriched with background noise and contextual information. Experimental results show that our Index-ASR achieves strong performance on both open-source benchmarks and in-house test sets, highlighting its robustness and practicality for real-world ASR applications.