LLM Sensitivity Evaluation Framework for Clinical Diagnosis

📅 2025-04-18
🏛️ International Conference on Computational Linguistics
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficient sensitivity of large language models (LLMs) to critical clinical information—such as symptom duration and vital sign trends—in diagnostic reasoning, systematically characterizing their diagnostic fragility for the first time. We propose a clinical diagnosis–oriented sensitivity evaluation framework, incorporating multi-dimensional, medical-semantic-preserving perturbation strategies: adversarial perturbations, entity substitutions, and temporal deformations. To quantify robustness, we introduce a diagnostic pathway consistency metric and validate results via expert annotation. Experiments across GPT-3.5, GPT-4, Gemini, Claude-3, and LLaMA2-7b reveal an average 27.6% drop in diagnostic accuracy under minor perturbations to key clinical cues, exposing significant vulnerabilities in clinical reasoning. We publicly release the DiagnosisQA benchmark dataset and evaluation code, establishing a new standard and open toolkit for advancing the reliability and clinical trustworthiness of LLMs in medicine.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated impressive performance across various domains. However, for clinical diagnosis, higher expectations are required for LLM's reliability and sensitivity: thinking like physicians and remaining sensitive to key medical information that affects diagnostic reasoning, as subtle variations can lead to different diagnosis results. Yet, existing works focus mainly on investigating the sensitivity of LLMs to irrelevant context and overlook the importance of key information. In this paper, we investigate the sensitivity of LLMs, i.e. GPT-3.5, GPT-4, Gemini, Claude3 and LLaMA2-7b, to key medical information by introducing different perturbation strategies. The evaluation results highlight the limitations of current LLMs in remaining sensitive to key medical information for diagnostic decision-making. The evolution of LLMs must focus on improving their reliability, enhancing their ability to be sensitive to key information, and effectively utilizing this information. These improvements will enhance human trust in LLMs and facilitate their practical application in real-world scenarios. Our code and dataset are available at https://github.com/chenwei23333/DiagnosisQA.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM sensitivity to key medical information for diagnosis
Assessing LLM reliability in clinical diagnostic decision-making
Improving LLM ability to detect and utilize critical medical data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLM sensitivity to key medical information
Introduces perturbation strategies for testing
Highlights need for improved reliability and sensitivity
🔎 Similar Papers
No similar papers found.
Chenwei Yan
Chenwei Yan
Beijing University of Posts and Telecommunications
Natural Language ProcessingLarge Language Models
X
Xiangling Fu
School of Computer Science, Beijing University of Posts and Telecommunications; Key Laboratory of Trustworthy Distributed Computing and Service(BUPT), Ministry of Education
Y
Yuxuan Xiong
School of Computer Science, Beijing University of Posts and Telecommunications; Key Laboratory of Trustworthy Distributed Computing and Service(BUPT), Ministry of Education
T
Tianyi Wang
School of Computer Science, Beijing University of Posts and Telecommunications; Key Laboratory of Trustworthy Distributed Computing and Service(BUPT), Ministry of Education
Siu Cheung Hui
Siu Cheung Hui
Associate Professor, School of Computer Engineering, Nanyang Technological University, Singapore
data miningtext miningnatural language procssingsemantic searchquestion-answering community
Ji Wu
Ji Wu
Tsinghua University
Artificial Intelligence,smart healthcaremachine learningpattern recognitionspeech recognition
Xien Liu
Xien Liu
Tsinghua University
Deep LearningMedicalNLPLarge Language Models