🤖 AI Summary
ASR outputs often contain errors that degrade downstream task performance. To address this, we propose LIR-ASR, a human-audition-inspired iterative correction framework featuring a novel three-stage “Listen–Imagine–Refine” mechanism. First, the ASR output is parsed (“Listen”). Next, a large language model generates phoneme-level variants and explicitly models speech uncertainty (“Imagine”). Finally, context-aware global optimization is performed under linguistic constraints via finite-state-machine-guided heuristic search (“Refine”), avoiding local optima while preserving semantic consistency. Evaluated across multilingual (Chinese and English) and multi-scenario ASR post-processing tasks, LIR-ASR achieves an average reduction of 1.5 percentage points in character error rate (CER) and word error rate (WER), significantly enhancing transcription robustness and accuracy.
📝 Abstract
Automatic Speech Recognition (ASR) systems remain prone to errors that affect downstream applications. In this paper, we propose LIR-ASR, a heuristic optimized iterative correction framework using LLMs, inspired by human auditory perception. LIR-ASR applies a "Listening-Imagining-Refining" strategy, generating phonetic variants and refining them in context. A heuristic optimization with finite state machine (FSM) is introduced to prevent the correction process from being trapped in local optima and rule-based constraints help maintain semantic fidelity. Experiments on both English and Chinese ASR outputs show that LIR-ASR achieves average reductions in CER/WER of up to 1.5 percentage points compared to baselines, demonstrating substantial accuracy gains in transcription.