WhisperD: Dementia Speech Recognition and Filler Word Detection with Whisper

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Dementia impairs speech fluency, increasing disfluencies, pauses, and utterance fragmentation—challenging mainstream ASR systems (e.g., Whisper) and degrading transcription accuracy. This work presents the first systematic adaptation of Whisper for dementia speech recognition. We propose a multi-objective supervised fine-tuning framework jointly optimizing word error rate (WER) and filler word identification rate (FIR), trained on DementiaBank and a newly curated dataset using Whisper-medium. Innovations include fine-grained filler word annotation and a weighted multi-task loss, enhancing model generalization. Our approach reduces WER to 0.24—outperforming baseline Whisper and prior methods—while simultaneously improving F1-score and FIR. Crucially, the model demonstrates robustness to unseen dysfluency patterns. This work establishes a high-accuracy, generalizable speech-to-text foundation for early dementia screening and assistive technologies.

Technology Category

Application Category

📝 Abstract

Whisper fails to correctly transcribe dementia speech because persons with dementia (PwDs) often exhibit irregular speech patterns and disfluencies such as pauses, repetitions, and fragmented sentences. It was trained on standard speech and may have had little or no exposure to dementia-affected speech. However, correct transcription is vital for dementia speech for cost-effective diagnosis and the development of assistive technology. In this work, we fine-tune Whisper with the open-source dementia speech dataset (DementiaBank) and our in-house dataset to improve its word error rate (WER). The fine-tuning also includes filler words to ascertain the filler inclusion rate (FIR) and F1 score. The fine-tuned models significantly outperformed the off-the-shelf models. The medium-sized model achieved a WER of 0.24, outperforming previous work. Similarly, there was a notable generalisability to unseen data and speech patterns.

Problem

Research questions and friction points this paper is trying to address.

Improving Whisper's transcription accuracy for dementia-affected speech

Detecting filler words in dementia speech for better analysis

Enhancing generalizability to unseen dementia speech patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tune Whisper with DementiaBank dataset

Include filler words for FIR and F1 scores

Achieve lower WER and better generalisability

🔎 Similar Papers

A Systematic Review of NLP for Dementia- Tasks, Datasets and Opportunities