🤖 AI Summary
Current medical large language models (LLMs) lack standardized clinical evaluation benchmarks that jointly assess both safety and efficacy. Method: We introduce the first Clinical Safety–Efficacy Dual-track Benchmark (CSEDB), constructed from 30 expert-consensus criteria and 2,069 real-world clinical Q&A instances spanning 26 medical specialties. It incorporates a novel weighted consequence metric and a dedicated high-risk scenario assessment framework, integrating multi-round expert adjudication with open-ended question design. Contribution/Results: Comprehensive evaluation reveals moderate overall performance across mainstream medical LLMs (safety: 54.7%, efficacy: 62.3%), with domain-specific models significantly outperforming general-purpose ones (best: safety 0.912, efficacy 0.861); performance degrades markedly in high-risk scenarios. CSEDB establishes a reproducible, multidisciplinary, and risk-aware evaluation paradigm for clinical deployment of medical LLMs.
📝 Abstract
Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.