Bridging ASR and LLMs for Dysarthric Speech Recognition: Benchmarking Self-Supervised and Generative Approaches

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This study addresses the severe phoneme distortions and high inter-speaker variability in dysarthric speech, which drastically degrade automatic speech recognition (ASR) performance. We propose a novel collaborative decoding paradigm integrating self-supervised speech models with large language models (LLMs). Specifically, we jointly leverage front-end acoustic models—including Wav2Vec 2.0, HuBERT, and Whisper—with both CTC and sequence-to-sequence outputs; backend constrained decoding is performed using LLMs (BART, GPT-2, and Vicuna) to restore phonemes and enforce syntactic and semantic consistency. Extensive experiments across multiple severity levels and cross-dataset benchmarks (e.g., UA-Speech, TORGO) demonstrate that LLM-augmented decoding significantly improves word error rate (WER) and intelligibility—achieving up to a 23.6% relative reduction in WER for severely dysarthric speech over baseline ASR systems—while markedly enhancing generalization and robustness. To our knowledge, this is the first systematic investigation validating the critical role of LLMs in joint semantic–acoustic modeling of dysarthric speech.

Technology Category

Application Category

📝 Abstract

Speech Recognition (ASR) due to phoneme distortions and high variability. While self-supervised ASR models like Wav2Vec, HuBERT, and Whisper have shown promise, their effectiveness in dysarthric speech remains unclear. This study systematically benchmarks these models with different decoding strategies, including CTC, seq2seq, and LLM-enhanced decoding (BART,GPT-2, Vicuna). Our contributions include (1) benchmarking ASR architectures for dysarthric speech, (2) introducing LLM-based decoding to improve intelligibility, (3) analyzing generalization across datasets, and (4) providing insights into recognition errors across severity levels. Findings highlight that LLM-enhanced decoding improves dysarthric ASR by leveraging linguistic constraints for phoneme restoration and grammatical correction.

Problem

Research questions and friction points this paper is trying to address.

Evaluating ASR models for dysarthric speech recognition

Exploring LLM-enhanced decoding for intelligibility improvement

Analyzing generalization across datasets and severity levels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking self-supervised ASR models

Introducing LLM-based decoding strategies

Leveraging linguistic constraints for improvement

🔎 Similar Papers

No similar papers found.