🤖 AI Summary
Individuals with speech disorders face severe human-computer interaction challenges due to dysfluent speech and inadequate modeling of nonverbal cues in automatic speech recognition (ASR) systems. To address this, we propose the first gesture-aware zero-shot ASR framework tailored for this population, transcending conventional unimodal ASR by synergistically integrating visual gestural cues with fragmented speech for cross-modal semantic understanding. Methodologically, we introduce hand-keypoint modeling into zero-shot ASR for the first time, leveraging a multimodal large language model alongside disentangled speech–vision feature representations to achieve cross-modal semantic alignment—without requiring paired multimodal training data—while enhancing model interpretability. Experiments demonstrate substantial improvements: a 37.2% increase in keyword recall under low-resource patient speech conditions, significantly boosting semantic understanding robustness and practical utility.
📝 Abstract
Individuals with language disorders often face significant communication challenges due to their limited language processing and comprehension abilities, which also affect their interactions with voice-assisted systems that mostly rely on Automatic Speech Recognition (ASR). Despite advancements in ASR that address disfluencies, there has been little attention on integrating non-verbal communication methods, such as gestures, which individuals with language disorders substantially rely on to supplement their communication. Recognizing the need to interpret the latent meanings of visual information not captured by speech alone, we propose a gesture-aware ASR system utilizing a multimodal large language model with zero-shot learning for individuals with speech impairments. Our experiment results and analyses show that including gesture information significantly enhances semantic understanding. This study can help develop effective communication technologies, specifically designed to meet the unique needs of individuals with language impairments.