🤖 AI Summary
This study addresses the significant performance degradation of automatic speech recognition (ASR) systems in older adults with cognitive impairment, which limits their ability to use voice-enabled smart assistants. We recruited 83 older participants across varying cognitive states and collected read-aloud voice commands transcribed using Amazon Alexa’s ASR system. Acoustic features—including vocal intensity, voice quality, and pause ratio—were modeled to analyze their relationship with ASR error rates. Our work is the first to systematically reveal a strong association between these acoustic characteristics and ASR accuracy in individuals with cognitive decline, demonstrating that participants with dementia exhibit significantly higher ASR error rates. Moreover, the identified acoustic features effectively predict transcription accuracy. These findings offer both empirical evidence and a novel direction for designing cognition-aware AgeTech interfaces that adapt to users’ cognitive abilities.
📝 Abstract
Millions of people live with cognitive impairment from Alzheimer's disease and related dementias (ADRD). Voice-enabled smart home systems offer promise for supporting daily living but rely on automatic speech recognition (ASR) to transcribe their speech to text. Prior work has shown reduced ASR performance for adults with cognitive impairment; however, the acoustic factors underlying these disparities remain poorly understood. This paper evaluates ASR performance for 83 older adults across cognitive groups (cognitively normal, mild cognitive impairment, dementia) reading commands to a voice assistant (Amazon Alexa). Results show that ASR errors are significantly higher for individuals with dementia, revealing a critical usability gap. To better understand these disparities, we conducted an acoustic analysis of speech features and found that a speaker's intensity, voice quality, and pause ratio predicted ASR accuracy. Based on these findings, we outline HCI design implications for AgeTech and voice interfaces, including speaker-personalized ASR, human-in-the-loop correction of ASR transcripts, and interaction-level personalization to support ability-based adaptation.