Safety and accuracy follow different scaling laws in clinical large language models

📅 2026-05-05
📈 Citations: 0
✨ Influential: 0
📄 PDF

career value

179K/year
🤖 AI Summary
This work addresses the critical observation that scaling clinical large language models does not necessarily enhance safety, as improved accuracy may coexist with high-risk errors that jeopardize medical decision-making. To tackle this challenge, the authors propose the SaFE-Scale framework to systematically evaluate how model scale, evidence quality, and retrieval strategies jointly influence safety. They introduce RadSaFE-200, a radiology-focused safety benchmark, and demonstrate for the first time that safety follows distinct scaling laws from accuracy, positioning it as an active property dependent on deployment design. Experiments across 34 local models and six deployment configurations—including closed-book prompting, standard RAG, and agent-based RAG—reveal that high-quality evidence boosts accuracy to 94.1% while substantially reducing high-risk errors (2.6%), evidence contradictions (2.3%), and hazardous overconfidence (1.6%). Notably, merely integrating RAG or increasing computational resources proves insufficient to ensure safety.
📝 Abstract
Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.
Problem

Research questions and friction points this paper is trying to address.

clinical LLM safety
scaling laws
evidence quality
high-risk error
retrieval strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

clinical LLM safety
SaFE-Scale
evidence quality
RadSaFE-200
high-risk error
🔎 Similar Papers
No similar papers found.
S
Sebastian Wind
Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
T
Tri-Thien Nguyen
Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
J
Jeta Sopa
Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
Mahshad Lotfinia
Mahshad Lotfinia
RWTH Aachen University
Artificial IntelligenceDeep LearningMedical Image Analysis
S
Sebastian Bickelhaupt
Institute of Radiology, University Hospital Erlangen, Erlangen, Germany
M
Michael Uder
Institute of Radiology, University Hospital Erlangen, Erlangen, Germany
H
Harald KĂśstler
Erlangen National High Performance Computing Center, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
Gerhard Wellein
Gerhard Wellein
Friedrich-Alexander-Universität Erlangen-Nßrnberg
HPCPerformance ModellingPerformance EngineeringSparse Solvers and Kernels
Sven Nebelung
Sven Nebelung
Department of Diagnostic and Interventional Radiology, University Hospital Aachen
Advanced MRI TechniquesFunctionality AssessmentBiomechanical ImagingCartilageArtificial Intelligence
Daniel Truhn
Daniel Truhn
Professor of Radiology, University Hospital Aachen
Machine LearningArtificial IntelligenceComputer VisionMedical Imaging
A
Andreas Maier
Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
Soroosh Tayebi Arasteh
Soroosh Tayebi Arasteh
RWTH Aachen University
Deep LearningAI in MedicineGenerative AIMedical Image Analysis