🤖 AI Summary
This study addresses the scarcity of speech data for early screening of Alzheimer’s disease and related dementias (ADRD). Methodologically, it proposes a multimodal detection framework integrating speech-text analysis and synthetic data augmentation: (1) joint fine-tuning of ten Transformer models with 110 handcrafted linguistic features; (2) first-time use of clinically tuned large language models (MedAlpaca-7B, LLaMA, GPT-4o) to generate high-fidelity, label-conditioned synthetic speech transcripts; and (3) systematic evaluation of multimodal models—including GPT-4o, Qwen-Omni, and Phi-4—under zero-shot and fine-tuned settings. A key contribution is the empirical validation that distributional similarity between synthetic and real data critically determines performance gains. The fused model achieves F1 = 83.3 (AUC = 89.5); incorporating double the MedAlpaca-generated synthetic data raises F1 to 85.7. Fine-tuning MedAlpaca itself improves its F1 from 47.3 to 78.5, demonstrating the efficacy and feasibility of synthetic data augmentation in low-resource ADRD screening.
📝 Abstract
Alzheimer's disease and related dementias (ADRD) affect approximately five million older adults in the U.S., yet over half remain undiagnosed. Speech-based natural language processing (NLP) offers a promising, scalable approach to detect early cognitive decline through linguistic markers.
To develop and evaluate a screening pipeline that (i) fuses transformer embeddings with handcrafted linguistic features, (ii) tests data augmentation using synthetic speech generated by large language models (LLMs), and (iii) benchmarks unimodal and multimodal LLM classifiers for ADRD detection.
Transcripts from the DementiaBank "cookie-theft" task (n = 237) were used. Ten transformer models were evaluated under three fine-tuning strategies. A fusion model combined embeddings from the top-performing transformer with 110 lexical-derived linguistic features. Five LLMs (LLaMA-8B/70B, MedAlpaca-7B, Ministral-8B, GPT-4o) were fine-tuned to generate label-conditioned synthetic speech, which was used to augment training data. Three multimodal models (GPT-4o, Qwen-Omni, Phi-4) were tested for speech-text classification in zero-shot and fine-tuned settings.
The fusion model achieved F1 = 83.3 (AUC = 89.5), outperforming linguistic or transformer-only baselines. Augmenting training data with 2x MedAlpaca-7B synthetic speech increased F1 to 85.7. Fine-tuning significantly improved unimodal LLM classifiers (e.g., MedAlpaca: F1 = 47.3 -> 78.5 F1). Current multimodal models demonstrated lower performance (GPT-4o = 70.2 F1; Qwen = 66.0). Performance gains aligned with the distributional similarity between synthetic and real speech.
Integrating transformer embeddings with linguistic features enhances ADRD detection from speech. Clinically tuned LLMs effectively support both classification and data augmentation, while further advancement is needed in multimodal modeling.