🤖 AI Summary
This study proposes VoxCog, a novel framework for automatic cross-lingual classification of cognitive impairments—such as Alzheimer’s disease and mild cognitive impairment—using only speech signals, thereby eliminating reliance on textual transcripts or multimodal data. Inspired by dialect identification, the approach treats pronunciation anomalies in patient speech (e.g., slowed speaking rate, prolonged syllables) as phonetic variations. Leveraging a pretrained speech foundation model, the system initializes a dialect classifier to build an end-to-end multilingual classification architecture. Evaluated on the ADReSS 2020 and ADReSSo 2021 test sets, VoxCog achieves accuracies of 87.5% and 85.9%, respectively, outperforming existing methods that depend on multimodal inputs or large language models. This work represents the first demonstration of efficient, modality-free, cross-lingual cognitive impairment detection solely from speech.
📝 Abstract
In this work, we present a novel perspective on cognitive impairment classification from speech by integrating speech foundation models that explicitly recognize speech dialects. Our motivation is based on the observation that individuals with Alzheimer's Disease (AD) or mild cognitive impairment (MCI) often produce measurable speech characteristics, such as slower articulation rate and lengthened sounds, in a manner similar to dialectal phonetic variations seen in speech. Building on this idea, we introduce VoxCog, an end-to-end framework that uses pre-trained dialect models to detect AD or MCI without relying on additional modalities such as text or images. Through experiments on multiple multilingual datasets for AD and MCI detection, we demonstrate that model initialization with a dialect classifier on top of speech foundation models consistently improves the predictive performance of AD or MCI. Our trained models yield similar or often better performance compared to previous approaches that ensembled several computational methods using different signal modalities. Particularly, our end-to-end speech-based model achieves 87.5% and 85.9% accuracy on the ADReSS 2020 challenge and ADReSSo 2021 challenge test sets, outperforming existing solutions that use multimodal ensemble-based computation or LLMs.