🤖 AI Summary
This work addresses the severe degradation of interaction quality in cascaded spoken dialogue systems caused by error propagation from automatic speech recognition (ASR), a problem inadequately mitigated by conventional confidence-based filtering due to its inability to distinguish among perceptual, comprehension, and deletion errors. To overcome this limitation, the authors propose a causality-aware error recovery paradigm that employs a lightweight detector to analyze deep ASR representations, enabling fine-grained disentanglement and diagnosis of the three error types. This diagnostic insight dynamically guides a large language model to generate targeted, multi-turn clarification strategies. Evaluated under domain-shift conditions, the approach more than doubles error recall (57.96% vs. 23.66%), reduces word error rate by 30%, and improves downstream task performance by 17%, significantly outperforming baselines while demonstrating robust generalization across diverse accents, signal distortions, and domains.
📝 Abstract
Cascaded Automatic Speech Recognition -- Large Language Model (ASR-LLM) pipelines remain popular for industrial Spoken Dialogue Systems (SDS), primarily because their decoupled design ensures perceptual verifiability. However, cascaded systems suffer from error propagation, as transcription failures inevitably cascade to subsequent components, thereby degrading the final interaction quality. Although ASR confidence scores offer a simple filter for unreliable inputs, this approach is fundamentally limited because it typically fails to detect deletion errors or to distinguish between acoustic (inability to hear clearly) and linguistic (inability to understand) mismatches, both of which require targeted recovery strategies. In this paper, we propose a cause-aware error recovery paradigm that fundamentally rethinks robustness in SDS. Unlike traditional confidence filtering, we introduce a suite of small precision-focused detectors that exploit deep ASR latent representations to disentangle token-level errors into perception, comprehension, and deletion failures. This fine-grained diagnostic intelligence empowers the LLM to orchestrate targeted, multi-turn clarification strategies, effectively transforming ambiguous signals into seamless user interactions. Experimental results validate the precision of our approach, which more than doubles the recall on domain-shift errors (57.96% vs. 23.66%) compared to baselines. Crucially, this diagnostic precision yields up to a 30% reduction in WER and a 17% improvement on the downstream task across diverse accents, distortions, and domains.