๐ค AI Summary
This work addresses the challenge of limited performance in conventional word-naming recognition methods for individuals with post-stroke aphasia, whose disfluent speech and mispronunciations hinder accurate identification. To tackle this issue, the study introduces, for the first time, a Contrastive LanguageโAudio Pretraining (CLAP) framework tailored to this task. By mapping spoken utterances and textual prompts into a shared embedding space, the approach achieves effective cross-modal alignment that robustly handles atypical pronunciation patterns. Integrating multimodal embedding alignment, textual prompt engineering, and deep neural networks, the proposed method attains up to 90% recognition accuracy on two French aphasic patient datasets, substantially outperforming existing baseline systems based on classification or automatic speech recognition.
๐ Abstract
Conventional automatic word-naming recognition systems struggle to recognize words from post-stroke patients with aphasia because of disfluencies and mispronunciations, limiting reliable automated assessment in this population. In this paper, we propose a Contrastive Language-Audio Pretraining (CLAP) based approach for automatic word-naming recognition to address this challenge by leveraging text-audio alignment. Our approach treats word-naming recognition as an audio-text matching problem, projecting speech signals and textual prompts into a shared embedding space to identify intended words even in challenging recordings. Evaluated on two speech datasets of French post-stroke patients with aphasia, our approach achieves up to 90% accuracy, outperforming existing classification-based and automatic speech recognition-based baselines.