🤖 AI Summary
To address the limited clinical interpretability of pathological speech detection models, this paper introduces, for the first time, a multimodal large language model (GPT-4o) to this task, proposing a few-shot in-context learning framework for interpretable detection. Methodologically, it integrates speech transcription text with clinical metadata as multimodal inputs and employs systematic prompt engineering and ablation analysis to enable end-to-end automatic detection and natural-language diagnostic reasoning generation. Experiments demonstrate state-of-the-art performance across multiple public pathological speech datasets, alongside consistent generation of clinically coherent diagnostic justifications—significantly enhancing clinicians’ trust in and practical utility of AI decisions. The core contribution lies in overcoming the limitations of conventional black-box models, establishing a novel paradigm for pathological speech analysis that simultaneously achieves high accuracy and strong clinical interpretability.
📝 Abstract
Automatic pathological speech detection approaches have shown promising results, gaining attention as potential diagnostic tools alongside costly traditional methods. While these approaches can achieve high accuracy, their lack of interpretability limits their applicability in clinical practice. In this paper, we investigate the use of multimodal Large Language Models (LLMs), specifically ChatGPT-4o, for automatic pathological speech detection in a few-shot in-context learning setting. Experimental results show that this approach not only delivers promising performance but also provides explanations for its decisions, enhancing model interpretability. To further understand its effectiveness, we conduct an ablation study to analyze the impact of different factors, such as input type and system prompts, on the final results. Our findings highlight the potential of multimodal LLMs for further exploration and advancement in automatic pathological speech detection.