Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Whisper exhibits hallucinations on non-speech segments (e.g., silence, noise), hindering its industrial deployment. To address this, we propose a lightweight, pre- and post-processing-free *head-level intervention* method. Through attribution analysis, we identify only three critical attention heads responsible for over 75% of non-speech hallucinations. We then perform targeted fine-tuning of these heads using a small-scale non-speech dataset, augmented by head-specific masking and interpretability-guided self-attention assessment to precisely suppress hallucinatory behavior. On UrbanSound, non-speech hallucinations decrease by over 80%; on LibriSpeech, word error rate increases negligibly (<0.1%), preserving high ASR robustness. This work is the first to uncover and directly intervene at the *attention-head level*—the root cause of Whisper’s hallucinations—establishing an interpretable, low-overhead optimization paradigm for trustworthy large-model speech recognition.

Technology Category

Application Category

📝 Abstract
OpenAI's Whisper has achieved significant success in Automatic Speech Recognition. However, it has consistently been found to exhibit hallucination issues, particularly in non-speech segments, which limits its broader application in complex industrial settings. In this paper, we introduce a novel method to reduce Whisper's hallucination on non-speech segments without using any pre- or post-possessing techniques. Specifically, we benchmark the contribution of each self-attentional head in the Whisper-large-v3 decoder to the hallucination problem by performing a head-wise mask. Our findings reveal that only 3 of the 20 heads account for over 75% of the hallucinations on the UrbanSound dataset. We then fine-tune these three crazy heads using a collection of non-speech data. The results show that our best fine-tuned model, namely Calm-Whisper, achieves over 80% reduction in non-speech hallucination with only less than 0.1% WER degradation on LibriSpeech test-clean and test-other.
Problem

Research questions and friction points this paper is trying to address.

Reduce Whisper's hallucination in non-speech segments
Identify key self-attention heads causing hallucinations
Fine-tune specific heads to minimize WER degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identify hallucination-causing heads via head-wise masking
Fine-tune specific heads using non-speech data
Reduce hallucinations by 80% with minimal WER impact
🔎 Similar Papers
No similar papers found.