🤖 AI Summary
This work proposes an interactive Agentic ASR framework that reframes automatic speech recognition as a multi-turn, closed-loop process to address the limitations of conventional single-pass systems, which struggle to correct semantically critical errors and rely on word-level metrics that poorly reflect semantic fidelity. The framework integrates front-end recognition, semantic error correction, intent-based routing, and reasoning-driven editing mechanisms to emulate human-like clarification in dialogue. A novel sentence-level Semantic Error Rate (S²ER), grounded in large language models, is introduced as an evaluation metric, alongside a scalable and reproducible interactive simulation benchmark. Experiments demonstrate substantial reductions in semantic error rates across multilingual, named-entity-dense, and code-switching scenarios, with S²ER improvements markedly exceeding those indicated by traditional metrics. Human evaluations further confirm the effectiveness of the semantic discriminator and the overall robustness of the proposed framework.
📝 Abstract
Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} ($S^2ER$), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^2ER$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/