When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This study addresses the significant performance degradation of current speech recognition systems when processing atypical speech, such as dysarthria, and their limited ability to effectively incorporate clinical multimodal context. Leveraging the Speech Accessibility Project dataset, the authors establish a benchmark to systematically evaluate, for the first time, the capacity of audio language models to utilize diagnostic labels, speech intelligibility scores, and clinical descriptions during inference. They propose a parameter-efficient LoRA-based fine-tuning approach that integrates multiple formats of clinical prompts. Experimental results demonstrate that naive contextual prompting is largely ineffective or even detrimental to existing models, whereas the proposed fine-tuning strategy maintains baseline performance on typical speech while achieving a 52% relative reduction in word error rate (WER = 0.066), with particularly pronounced gains for speakers with Down syndrome and those with mild impairments.

📝 Abstract

Automatic speech recognition (ASR) systems remain brittle on dysarthric and other atypical speech. Recent audio-language models raise the possibility of improving performance by conditioning on additional clinical context at inference time, but it is unclear whether these models can make use of such information. We introduce a benchmark built on the Speech Accessibility Project (SAP) dataset that tests whether diagnosis labels, clinician-derived speech ratings, and progressively richer clinical descriptions improve transcription accuracy for dysarthric speech. Across matched comparisons on nine models, we find that current models do not meaningfully use this context: diagnosis-informed and clinically detailed prompts yield negligible improvements and often degrade word error rate. We complement the prompting analysis with context-dependent fine-tuning, showing that LoRA adaptation with a mixture of clinical prompt formats achieves a WER of 0.066, a 52% relative reduction over the frozen baseline, while preserving performance when context is unavailable. Subgroup analyses reveal significant gains for Down syndrome and mild-severity speakers. These results clarify where current models fall short and provide a testbed for measuring progress toward more inclusive ASR.

Problem

Research questions and friction points this paper is trying to address.

dysarthric speech recognition

audio-language models

multimodal context

clinical context

automatic speech recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

dysarthric speech recognition

audio-language models

clinical context