🤖 AI Summary
Existing automatic speech recognition (ASR) evaluation frameworks largely neglect contextual information, resulting in limited capabilities in context modeling, long-term memory retention, and world knowledge reasoning. Method: We propose ContextASR, the first large-scale contextual ASR benchmark comprising 40,000 speech-text pairs across 10+ domains, annotated with multi-level contextual metadata. It introduces a novel dual-granularity evaluation paradigm—fine-grained (e.g., dialogue history) and coarse-grained (e.g., domain-specific knowledge)—jointly assessing named entity recognition and knowledge-aware reasoning. Our approach integrates the Large Autoregressive Language Model (LALM) architecture with a context-aware multi-stage decoding mechanism. Contribution/Results: Experiments demonstrate that LALMs endowed with strong in-context learning and world knowledge substantially outperform conventional ASR systems, achieving up to 23.6% improvement on entity recognition and cross-turn coreference resolution. This work advances ASR from isolated transcription toward embodied, context-aware speech understanding.
📝 Abstract
Automatic Speech Recognition (ASR) has been extensively investigated, yet prior evaluative efforts have largely been restricted to contextless paradigms. This constraint stems from the limited proficiency of conventional ASR models in context modeling and their deficiency in memory and reasoning based on world knowledge. Recent breakthroughs in the development of Large Language Models (LLMs) and corresponding Large Audio Language Models (LALMs) have markedly enhanced the visibility of general artificial intelligence capabilities. Consequently, there exists a compelling need for a benchmark that can evaluate both the generality and intelligence of ASR systems. To address this gap, we propose ContextASR-Bench: a comprehensive, large-scale benchmark designed to assess contextual speech recognition. This benchmark encompasses up to 40,000 data entries across over 10 domains, enabling a thorough evaluation of model performance in scenarios that omit or incorporate coarse-grained or fine-grained contextual information. Moreover, diverging from conventional ASR evaluations, our benchmark includes an analysis of model efficacy in recognizing named entities mentioned within the auditory input. Our extensive evaluation highlights that LALMs, with strong world knowledge and context learning capabilities, outperform conventional ASR models by a large margin. The dataset and evaluation code have been released at https://github.com/MrSupW/ContextASR-Bench.