Susceptibility of Large Language Models to User-Driven Factors in Medical Queries

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how user-driven factors—such as question phrasing, clinical information completeness, authority-based misdirection, and role prompting—affect the diagnostic reliability of large language models (LLMs) in medicine. Using MedQA and MedBullets, we conduct controlled perturbation and structured ablation experiments across leading closed- and open-weight models (GPT-4o, Claude 3.5, Gemini 1.5, LLaMA 3, and DeepSeek R1). We quantitatively demonstrate that authority-based misdirection significantly impairs proprietary models’ accuracy, with assertive phrasing exerting the strongest interference; meanwhile, omission of critical clinical data (e.g., physical exam findings or lab results) induces the steepest performance degradation. Notably, we uncover a counterintuitive finding: higher-performing proprietary models exhibit greater susceptibility to user-side biases. These results expose critical vulnerabilities of LLMs in realistic clinical Q&A settings and provide empirical foundations for designing trustworthy medical AI systems.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly used in healthcare, but their reliability is heavily influenced by user-driven factors such as question phrasing and the completeness of clinical information. In this study, we examined how misinformation framing, source authority, model persona, and omission of key clinical details affect the diagnostic accuracy and reliability of LLM outputs. We conducted two experiments: one introducing misleading external opinions with varying assertiveness (perturbation test), and another removing specific categories of patient information (ablation test). Using public datasets (MedQA and Medbullets), we evaluated proprietary models (GPT-4o, Claude 3.5 Sonnet, Claude 3.5 Haiku, Gemini 1.5 Pro, Gemini 1.5 Flash) and open-source models (LLaMA 3 8B, LLaMA 3 Med42 8B, DeepSeek R1 8B). All models were vulnerable to user-driven misinformation, with proprietary models especially affected by definitive and authoritative language. Assertive tone had the greatest negative impact on accuracy. In the ablation test, omitting physical exam findings and lab results caused the most significant performance drop. Although proprietary models had higher baseline accuracy, their performance declined sharply under misinformation. These results highlight the need for well-structured prompts and complete clinical context. Users should avoid authoritative framing of misinformation and provide full clinical details, especially for complex cases.
Problem

Research questions and friction points this paper is trying to address.

Examining LLM reliability under user-driven misinformation in medical queries
Assessing impact of missing clinical details on LLM diagnostic accuracy
Evaluating proprietary vs open-source models' vulnerability to misleading inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Perturbation test with misleading external opinions
Ablation test removing key clinical details
Evaluated proprietary and open-source LLMs
🔎 Similar Papers
No similar papers found.
K
Kyung Ho Lim
Department of Psychiatry, Yonsei University College of Medicine; Institute of Behavioral Science in Medicine, Yonsei University College of Medicine
U
Ujin Kang
Department of Computer Science and Engineering, Yonsei University
X
Xiang Li
Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital, Harvard Medical School
J
Jin Sung Kim
Department of Radiation Oncology, Yonsei University College of Medicine; Institute for Innovation in Digital Healthcare, Yonsei University; Oncosoft Inc.
Y
Young-Chul Jung
Department of Psychiatry, Yonsei University College of Medicine; Institute of Behavioral Science in Medicine, Yonsei University College of Medicine; Institute for Innovation in Digital Healthcare, Yonsei University
Sangjoon Park
Sangjoon Park
Department of Radiation Oncology, Yonsei University College of Medicine
Deep learningMedical ImagingRadiation Oncology
Byung-Hoon Kim
Byung-Hoon Kim
Yonsei University, College of Medicine
PsychiatryNeuroimagingLarge Multimodal Models