A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Current medical large language models (LLMs) lack standardized clinical evaluation benchmarks that jointly assess both safety and efficacy. Method: We introduce the first Clinical Safety–Efficacy Dual-track Benchmark (CSEDB), constructed from 30 expert-consensus criteria and 2,069 real-world clinical Q&A instances spanning 26 medical specialties. It incorporates a novel weighted consequence metric and a dedicated high-risk scenario assessment framework, integrating multi-round expert adjudication with open-ended question design. Contribution/Results: Comprehensive evaluation reveals moderate overall performance across mainstream medical LLMs (safety: 54.7%, efficacy: 62.3%), with domain-specific models significantly outperforming general-purpose ones (best: safety 0.912, efficacy 0.861); performance degrades markedly in high-risk scenarios. CSEDB establishes a reproducible, multidisciplinary, and risk-aware evaluation paradigm for clinical deployment of medical LLMs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.

Problem

Research questions and friction points this paper is trying to address.

Evaluating safety and effectiveness of medical LLMs in clinical settings

Developing a benchmark for clinical decision support validation

Identifying performance gaps in high-risk medical scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed Clinical Safety-Effectiveness Dual-Track Benchmark

Created 2,069 expert-reviewed clinical Q&A items

Tested six LLMs with weighted consequence measures

🔎 Similar Papers

No similar papers found.