🤖 AI Summary
This study investigates the feasibility of using non-diagnostic, natural conversational speech—specifically turn-taking (TT) interactions—for automated Parkinson’s disease (PD) detection, challenging the prevailing paradigm reliant on diagnostic speech tasks (e.g., sustained vowels or read speech).
Method: We employ acoustic feature extraction, binary classification modeling, and cross-dataset transfer evaluation. To mitigate biases, we apply resampling and data augmentation to balance gender and diagnostic status distributions.
Contribution/Results: Our systematic evaluation demonstrates that models trained solely on TT data achieve PD classification performance comparable to those trained on the canonical diagnostic dataset PC-GITA. Moreover, TT-trained models exhibit superior generalization to PC-GITA—revealing an asymmetry in cross-dataset generalizability. Variance analysis identifies inter-subject variability—not dataset or model architecture—as the dominant source of cross-validation instability. These findings provide methodological grounding and empirical evidence for unobtrusive, everyday-speech-based PD screening.
📝 Abstract
Speech-based Parkinson's disease (PD) detection has gained attention for its automated, cost-effective, and non-intrusive nature. As research studies usually rely on data from diagnostic-oriented speech tasks, this work explores the feasibility of diagnosing PD on the basis of speech data not originally intended for diagnostic purposes, using the Turn-Taking (TT) dataset. Our findings indicate that TT can be as useful as diagnostic-oriented PD datasets like PC-GITA. We also investigate which specific dataset characteristics impact PD classification performance. The results show that concatenating audio recordings and balancing participants' gender and status distributions can be beneficial. Cross-dataset evaluation reveals that models trained on PC-GITA generalize poorly to TT, whereas models trained on TT perform better on PC-GITA. Furthermore, we provide insights into the high variability across folds, which is mainly due to large differences in individual speaker performance.