LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the absence of a Chinese psychiatric evaluation benchmark capable of simultaneously simulating realistic patients, providing clinically validated labels, and supporting dynamic multi-turn diagnostic interviews—limitations that hinder the application of large language models (LLMs) in mental health diagnostics. To bridge this gap, we propose the first Chinese psychiatric assessment framework integrating real-world clinical distributions, multi-turn interactive dialogues, and expert-validated labels. Leveraging a multi-agent simulation approach, we construct LingxiDiag-16K, a dataset comprising 16,000 consultation dialogues aligned with electronic medical record distributions, and employ LLM-as-a-Judge to evaluate interview quality. Experiments reveal that while leading models achieve 92.3% accuracy in binary depression–anxiety classification, performance drops substantially in comorbidity recognition (43.0%) and 12-class differential diagnosis (28.5%), with dynamic interviewing consistently underperforming static assessments—highlighting critical limitations in complex diagnostic scenarios.

Technology Category

Application Category

📝 Abstract
Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.
Problem

Research questions and friction points this paper is trying to address.

psychiatric diagnosis
benchmark
LLM evaluation
multi-turn consultation
mental health
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent benchmark
LLM evaluation
psychiatric diagnosis
synthetic dialogue dataset
dynamic consultation
🔎 Similar Papers
No similar papers found.
S
Shihao Xu
EverMind AI Inc., Tianqiao and Chrissy Chen Institute
T
Tiancheng Zhou
EverMind AI Inc.
J
Jiatong Ma
EverMind AI Inc.
Y
Yanli Ding
Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine
Y
Yiming Yan
Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine
Ming Xiao
Ming Xiao
Professor, KTH
Network and Channel CodingWireless CommunicationsMachine Learning
G
Guoyi Li
EverMind AI Inc.
Haiyang Geng
Haiyang Geng
University of Groningen, University Medical Center Groningen, the Netherlands
Anxiety(Stress)Executive ControlCognitive and Affective NeuroscienceDynamic Model of Brain NetworksHallucination in Schi
Yunyun Han
Yunyun Han
Huazhong University of Science and Technology
Neuroscience
J
Jianhua Chen
Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine
Yafeng Deng
Yafeng Deng
baidu
Large language modelLong term memoryContinue learning