π€ AI Summary
This study investigates the propagation of automatic speech recognition (ASR) errors in Korean spoken question-answering systems that employ an ASRβlarge language model (LLM) cascade, and the resulting semantic failures. Through ASR error analysis, semantic failure evaluation, and comparative experiments with end-to-end audio-language models, the authors demonstrate that even single-character ASR errors can cause complete downstream QA failure. They find that information loss during ASR is the primary driver of performance degradation, and LLMs of varying capabilities exhibit similar sensitivity to such errors. The results indicate that end-to-end models directly processing audio inputs significantly outperform conventional cascaded architectures in noisy conditions, effectively mitigating semantic information loss caused by transcription errors.
π Abstract
We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. Our analysis shows that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance, suggesting that cascade degradation largely tracks ASR-stage information loss. We further identify single-character Korean ASR errors as a distinct semantic-failure channel, where the gold answer becomes entirely absent from the downstream prediction despite only a minimal transcription difference. Finally, an auxiliary comparison shows that a large audio language model outperforms an ASR-LLM pipeline with a matched language backbone in noisy Korean SQA, indicating the potential of direct audio input to mitigate transcript-induced information loss.