Clinical knowledge in LLMs does not translate to human interactions

📅 2025-04-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) achieve high accuracy on standardized medical knowledge assessments (e.g., GPT-4o: 94.9% disease identification), yet their real-world effectiveness in public health consultation remains unverified and potentially overstated. Method: We conducted a randomized controlled trial with 1,298 participants comparing real-time LLM assistance (GPT-4o, Llama 3, Command R+) against autonomous information retrieval for diagnosing common conditions and selecting appropriate management actions. Contribution/Results: Under LLM assistance, diagnostic accuracy dropped to <34.5% and management decision accuracy to <44.2%—statistically indistinguishable from the control group. This is the first empirical demonstration that high performance on static medical benchmarks does not predict effective human–LLM interaction in authentic clinical decision support. The findings expose a critical clinical validity gap in current evaluation paradigms. We advocate mandating human-in-the-loop randomized controlled trials as a prerequisite for deploying LLMs in healthcare applications, shifting assessment focus from isolated model capability to interactive real-world efficacy.

Technology Category

Application Category

📝 Abstract
Global healthcare providers are exploring use of large language models (LLMs) to provide medical advice to the public. LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings. We tested if LLMs can assist members of the public in identifying underlying conditions and choosing a course of action (disposition) in ten medical scenarios in a controlled study with 1,298 participants. Participants were randomly assigned to receive assistance from an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control). Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in less than 34.5% of cases and disposition in less than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice. Standard benchmarks for medical knowledge and simulated patient interactions do not predict the failures we find with human participants. Moving forward, we recommend systematic human user testing to evaluate interactive capabilities prior to public deployments in healthcare.
Problem

Research questions and friction points this paper is trying to address.

LLMs fail to improve public accuracy in medical condition identification
Human interaction challenges limit LLM effectiveness in healthcare advice
Standard benchmarks do not predict real-world LLM medical performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tested LLMs in controlled medical scenarios
Compared LLM performance with human interactions
Recommended systematic human testing pre-deployment
🔎 Similar Papers
No similar papers found.
A
Andrew M. Bean
Oxford Internet Institute, University of Oxford, Oxford, UK.
R
Rebecca Payne
Nuffield Department of Primary Care, University of Oxford, Oxford, UK., North Wales Medical School, Bangor University, Bangor, UK.
G
Guy Parsons
Oxford Internet Institute, University of Oxford, Oxford, UK.
Hannah Rose Kirk
Hannah Rose Kirk
University of Oxford
Large language modelsNLPEthics in AIAlignmentAI Safety
J
Juan Ciro
Contextual AI, Mountain View, USA.
R
Rafael Mosquera
MLCommons, San Francisco, USA., Factored AI, Palo Alto, USA.
S
Sara Hincapi'e Monsalve
MLCommons, San Francisco, USA., Factored AI, Palo Alto, USA.
A
Aruna S. Ekanayaka
Birmingham Women’s and Children’s NHS Foundation Trust, Birmingham, UK.
Lionel Tarassenko
Lionel Tarassenko
Institute of Biomedical Engineering, University of Oxford, Oxford, UK.
Luc Rocher
Luc Rocher
Associate Professor, University of Oxford
PrivacyAlgorithm AuditingAlgorithmic FairnessMachine Learning
Adam Mahdi
Adam Mahdi
Associate Professor, University of Oxford
large language modelsmultimodal AIdigital health