First, do NOHARM: towards clinically safe large language models

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) lack rigorous clinical safety evaluation in medical consultation, particularly within real-world referral scenarios. Method: We introduce NOHARM, the first clinical harm benchmark grounded in 100 authentic, cross-specialty (10 domains) referral cases, accompanied by 4,249 expert-annotated clinical management options and 12,747 fine-grained annotations—enabling systematic quantification of harm frequency and severity in LLM-generated recommendations. Results: Up to 22.2% of cases entail severe harm risk, with 76.6% attributable to omission errors; the best-performing single LLM surpasses average clinicians in safety by 9.7%; integrating multi-agent collaboration further reduces harm incidence by 8.0%. This work establishes “clinical safety” as a distinct, measurable evaluation dimension, exposes a critical gap between existing benchmarks and real-world clinical risk, and provides a reproducible, scalable assessment framework for the safe deployment of medical LLMs.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a benchmark using 100 real primary-care-to-specialist consultation cases to measure harm frequency and severity from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 31 LLMs, severe harm occurs in up to 22.2% (95% CI 21.6-22.8%) of cases, with harms of omission accounting for 76.6% (95% CI 76.4-76.8%) of errors. Safety performance is only moderately correlated (r = 0.61-0.64) with existing AI and medical knowledge benchmarks. The best models outperform generalist physicians on safety (mean difference 9.7%, 95% CI 7.0-12.5%), and a diverse multi-agent approach reduces harm compared to solo models (mean difference 8.0%, 95% CI 4.0-12.1%). Therefore, despite strong performance on existing evaluations, widely used AI models can produce severely harmful medical advice at nontrivial rates, underscoring clinical safety as a distinct performance dimension necessitating explicit measurement.
Problem

Research questions and friction points this paper is trying to address.

Assessing harm frequency and severity in LLM medical advice
Evaluating clinical safety across multiple medical specialties
Identifying gaps between AI knowledge benchmarks and actual safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed NOHARM benchmark for clinical safety assessment
Used multi-agent approach to reduce harmful medical advice
Measured harm frequency and severity across 31 LLMs
🔎 Similar Papers
No similar papers found.
D
David Wu
Harvard Combined Dermatology Program, Boston, MA, USA; Department of Dermatology, Mass General Brigham, Boston, MA, USA; Harvard Medical School, Boston, MA, USA
F
Fateme Nateghi Haredasht
Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
S
Saloni Kumar Maharaj
Division of Hospital Medicine, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
P
Priyank Jain
Harvard Medical School, Boston, MA, USA; Department of Medicine, Cambridge Health Alliance, Cambridge, MA, USA
J
Jessica Tran
Division of Hospital Medicine, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
M
Matthew Gwiazdon
Beth Israel Deaconess Hospital–Plymouth, Plymouth, MA, USA
A
Arjun Rustagi
Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
Jenelle Jindal
Jenelle Jindal
Stanford University
J
Jacob M. Koshy
Harvard Medical School, Boston, MA, USA; Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
V
Vinay Kadiyala
Harvard Medical School, Boston, MA, USA; Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
A
Anup Agarwal
Harvard Medical School, Boston, MA, USA; Department of Medicine, Cambridge Health Alliance, Cambridge, MA, USA
B
Bassman Tappuni
Harvard Medical School, Boston, MA, USA; Division of Cardiology, Department of Medicine, Cambridge Health Alliance, Cambridge, MA, USA
B
Brianna French
Department of Cardiovascular Medicine, Summa Health System, Akron, OH, USA
S
Sirus Jesudasen
Division of Allergy, Pulmonary, and Critical Care Medicine, Department of Medicine, University of Wisconsin-Madison, Madison, WI, USA
C
Christopher V. Cosgriff
Harvard Medical School, Boston, MA, USA; Division of Pulmonary and Critical Care Medicine, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA; Center for Immunology and Inflammatory Diseases, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
R
Rebanta Chakraborty
Harvard Medical School, Boston, MA, USA
J
Jillian Caldwell
Division of Hospital Medicine, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
S
Susan Ziolkowski
Division of Hospital Medicine, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
D
David J. Iberri
Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
R
Robert Diep
Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
R
Rahul S. Dalal
Harvard Medical School, Boston, MA, USA
K
Kira L. Newman
Department of Neurology, Stanford University School of Medicine, Stanford, CA, USA
K
Kristin Galetta
Department of Neurology, Stanford University School of Medicine, Stanford, CA, USA
J
J. Carl Pallais
Harvard Medical School, Boston, MA, USA
N
Nancy Wei
Harvard Medical School, Boston, MA, USA