Bias and Fairness in Self-Supervised Acoustic Representations for Cognitive Impairment Detection

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses significant performance disparities across gender, age, and depression status subgroups in existing speech-based cognitive impairment detection models, which compromise fairness and generalizability. It systematically evaluates subgroup performance of traditional acoustic features (MFCCs, eGeMAPS) and multi-layer embeddings from Wav2Vec 2.0 on cognitive impairment and depression classification tasks, revealing—for the first time—bias issues in self-supervised speech representations for clinical applications. Fairness is quantified using unweighted average recall (UAR), AUC, and specificity gap (Δ_spec). While high-level Wav2Vec 2.0 embeddings achieve a UAR of 80.6% in cognitive impairment detection, they exhibit substantial specificity gaps of up to 18% for females and 15% for younger individuals, alongside limited depression detection performance and poor cross-task generalization. The work introduces metrics for quantifying representational bias and underscores the necessity of incorporating fairness evaluation in clinical speech models.

Technology Category

Application Category

📝 Abstract
Speech-based detection of cognitive impairment (CI) offers a promising non-invasive approach for early diagnosis, yet performance disparities across demographic and clinical subgroups remain underexplored, raising concerns around fairness and generalizability. This study presents a systematic bias analysis of acoustic-based CI and depression classification using the DementiaBank Pitt Corpus. We compare traditional acoustic features (MFCCs, eGeMAPS) with contextualized speech embeddings from Wav2Vec 2.0 (W2V2), and evaluate classification performance across gender, age, and depression-status subgroups. For CI detection, higher-layer W2V2 embeddings outperform baseline features (UAR up to 80.6\%), but exhibit performance disparities; specifically, females and younger participants demonstrate lower discriminative power (\(AUC\): 0.769 and 0.746, respectively) and substantial specificity disparities (\(Δ_{spec}\) up to 18\% and 15\%, respectively), leading to a higher risk of misclassifications than their counterparts. These disparities reflect representational biases, defined as systematic differences in model performance across demographic or clinical subgroups. Depression detection within CI subjects yields lower overall performance, with mild improvements from low and mid-level W2V2 layers. Cross-task generalization between CI and depression classification is limited, indicating that each task depends on distinct representations. These findings emphasize the need for fairness-aware model evaluation and subgroup-specific analysis in clinical speech applications, particularly in light of demographic and clinical heterogeneity in real-world applications.
Problem

Research questions and friction points this paper is trying to address.

bias
fairness
cognitive impairment detection
acoustic representations
performance disparities
Innovation

Methods, ideas, or system contributions that make the work stand out.

fairness
self-supervised learning
acoustic representations
bias analysis
Wav2Vec 2.0