Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitation of miscue detection performance caused by inaccurate ASR transcriptions, this paper proposes the first end-to-end miscue annotation framework. Built upon the Whisper model, it incorporates the target text as a decoding prompt to jointly optimize verbatim transcription and miscue type classification—eliminating reliance on post-hoc alignment. Crucially, we empirically demonstrate that text prompting yields superior verbatim transcription accuracy compared to fine-tuning alone. The framework further integrates multi-task learning, reading-conditioned decoding, and domain adaptation for children’s and atypical adult speech. Experiments on two realistic oral reading scenarios show a 12.3% reduction in verbatim word error rate (WER) and an 18.7% improvement in miscue detection F1-score, significantly outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Identifying mistakes (i.e., miscues) made while reading aloud is commonly approached post-hoc by comparing automatic speech recognition (ASR) transcriptions to the target reading text. However, post-hoc methods perform poorly when ASR inaccurately transcribes verbatim speech. To improve on current methods for reading error annotation, we propose a novel end-to-end architecture that incorporates the target reading text via prompting and is trained for both improved verbatim transcription and direct miscue detection. Our contributions include: first, demonstrating that incorporating reading text through prompting benefits verbatim transcription performance over fine-tuning, and second, showing that it is feasible to augment speech recognition tasks for end-to-end miscue detection. We conducted two case studies -- children's read-aloud and adult atypical speech -- and found that our proposed strategies improve verbatim transcription and miscue detection compared to current state-of-the-art.
Problem

Research questions and friction points this paper is trying to address.

Improving verbatim transcription accuracy in ASR systems
Enabling end-to-end miscue detection during read-aloud tasks
Enhancing error annotation methods for reading assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompting Whisper for verbatim transcription improvement
End-to-end architecture for direct miscue detection
Augmenting ASR tasks with reading text prompts
🔎 Similar Papers
No similar papers found.