A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

📅 2025-07-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current medical large language models (LLMs) lack standardized clinical evaluation benchmarks that jointly assess both safety and efficacy. Method: We introduce the first Clinical Safety–Efficacy Dual-track Benchmark (CSEDB), constructed from 30 expert-consensus criteria and 2,069 real-world clinical Q&A instances spanning 26 medical specialties. It incorporates a novel weighted consequence metric and a dedicated high-risk scenario assessment framework, integrating multi-round expert adjudication with open-ended question design. Contribution/Results: Comprehensive evaluation reveals moderate overall performance across mainstream medical LLMs (safety: 54.7%, efficacy: 62.3%), with domain-specific models significantly outperforming general-purpose ones (best: safety 0.912, efficacy 0.861); performance degrades markedly in high-risk scenarios. CSEDB establishes a reproducible, multidisciplinary, and risk-aware evaluation paradigm for clinical deployment of medical LLMs.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.
Problem

Research questions and friction points this paper is trying to address.

Evaluating safety and effectiveness of medical LLMs in clinical settings
Developing a benchmark for clinical decision support validation
Identifying performance gaps in high-risk medical scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed Clinical Safety-Effectiveness Dual-Track Benchmark
Created 2,069 expert-reviewed clinical Q&A items
Tested six LLMs with weighted consequence measures
🔎 Similar Papers
No similar papers found.
S
Shirui Wang
Medlinker Intelligent and Digital Technology Co., Ltd, Beijing, China
Z
Zhihui Tang
Peking University School of Stomatology, Haidian, Beijing, China
H
Huaxia Yang
Department of Rheumatology and Clinical Immunology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
Q
Qiuhong Gong
Center of Endocrinology, National Center of Cardiology & Fuwai Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beiing, China
T
Tiantian Gu
Medlinker Intelligent and Digital Technology Co., Ltd, Beijing, China
H
Hongyang Ma
Peking University School of Stomatology, Haidian, Beijing, China
Y
Yongxin Wang
Medlinker Intelligent and Digital Technology Co., Ltd, Beijing, China
W
Wubin Sun
Medlinker Intelligent and Digital Technology Co., Ltd, Beijing, China
Z
Zeliang Lian
Medlinker Intelligent and Digital Technology Co., Ltd, Beijing, China
K
Kehang Mao
Medlinker Intelligent and Digital Technology Co., Ltd, Beijing, China
Y
Yinan Jiang
Department of Psychological Medicine, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
Z
Zhicheng Huang
Department of Thoracic Surgery, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
L
Lingyun Ma
Department of Respiratory and Critical Care Medicine, the 8th Medical Center of PLA General Hospital, Beijing, China
Wenjie Shen
Wenjie Shen
Department of Obstetrics & Gynecology, the Fourth Medical Center of PLA General Hospital, Beijing, China
Y
Yajie Ji
Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine, Shanghai, China
Y
Yunhui Tan
Department of Urology, The Second Affiliated Hospital of Harbin Medical University, Heilongjiang Province, China
C
Chunbo Wang
Department of Radiation Oncology, Harbin Medical University Cancer Hospital, Harbin, Heilongjiang Province, China
Y
Yunlu Gao
Department of Dermatology, Shanghai Skin Disease Hospital, Tongji University School of Medicine, Shanghai, China
Q
Qianling Ye
Department of Oncology, East Hospital Affiliated to Tongji University, Tongji University School of Medicine, Tongji University, Shanghai, China
R
Rui Lin
Mingyu Chen
Mingyu Chen
中国科学与技术大学硕士
Quantum compilation
L
Lijuan Niu
Zhihao Wang
Zhihao Wang
Peking University
RoboticsReinforcement Learning
P
Peng Yu
M
Mengran Lang