🤖 AI Summary
This study addresses the absence of a Chinese psychiatric evaluation benchmark capable of simultaneously simulating realistic patients, providing clinically validated labels, and supporting dynamic multi-turn diagnostic interviews—limitations that hinder the application of large language models (LLMs) in mental health diagnostics. To bridge this gap, we propose the first Chinese psychiatric assessment framework integrating real-world clinical distributions, multi-turn interactive dialogues, and expert-validated labels. Leveraging a multi-agent simulation approach, we construct LingxiDiag-16K, a dataset comprising 16,000 consultation dialogues aligned with electronic medical record distributions, and employ LLM-as-a-Judge to evaluate interview quality. Experiments reveal that while leading models achieve 92.3% accuracy in binary depression–anxiety classification, performance drops substantially in comorbidity recognition (43.0%) and 12-class differential diagnosis (28.5%), with dynamic interviewing consistently underperforming static assessments—highlighting critical limitations in complex diagnostic scenarios.
📝 Abstract
Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.