The MedPerturb Dataset: What Non-Content Perturbations Reveal About Human and Clinical LLM Decision Making

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the robustness of medical large language models (LLMs) in clinical decision-making, specifically examining how non-semantic perturbations—such as patient gender, linguistic style, and output format—affect human-AI decision consistency. Method: We introduce MedPerturb, a novel dataset derived from real-world clinical cases, and employ multidimensional controllable text perturbation generation, cross-model evaluation (four LLMs), multi-expert annotation (three clinicians per case), and causal sensitivity analysis to quantify human-AI discrepancies across 800 instances. Contribution/Results: We find that LLMs exhibit significantly higher sensitivity to gender and stylistic perturbations, whereas humans are more susceptible to LLM-typical output formats (e.g., summaries, multi-turn dialogues), revealing a fundamental divergence in robustness mechanisms between humans and LLMs. These findings shift clinical AI evaluation from static correctness toward dynamic scenario adaptability and establish a new assessment paradigm grounded in real-world clinical variability.

Technology Category

Application Category

📝 Abstract
Clinical robustness is critical to the safe deployment of medical Large Language Models (LLMs), but key questions remain about how LLMs and humans may differ in response to the real-world variability typified by clinical settings. To address this, we introduce MedPerturb, a dataset designed to systematically evaluate medical LLMs under controlled perturbations of clinical input. MedPerturb consists of clinical vignettes spanning a range of pathologies, each transformed along three axes: (1) gender modifications (e.g., gender-swapping or gender-removal); (2) style variation (e.g., uncertain phrasing or colloquial tone); and (3) format changes (e.g., LLM-generated multi-turn conversations or summaries). With MedPerturb, we release a dataset of 800 clinical contexts grounded in realistic input variability, outputs from four LLMs, and three human expert reads per clinical context. We use MedPerturb in two case studies to reveal how shifts in gender identity cues, language style, or format reflect diverging treatment selections between humans and LLMs. We find that LLMs are more sensitive to gender and style perturbations while human annotators are more sensitive to LLM-generated format perturbations such as clinical summaries. Our results highlight the need for evaluation frameworks that go beyond static benchmarks to assess the similarity between human clinician and LLM decisions under the variability characteristic of clinical settings.
Problem

Research questions and friction points this paper is trying to address.

Evaluates medical LLM sensitivity to non-content clinical input perturbations
Compares human and LLM decision differences under clinical variability
Assesses impact of gender, style, format changes on treatment selections
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset MedPerturb evaluates clinical LLM robustness
Perturbs gender, style, format in clinical inputs
Compares human and LLM decision-making under variability
A
Abinitha Gourabathina
Massachusetts Institute of Technology
Yuexing Hao
Yuexing Hao
Research Fellow
Human Computer InteractionHealth Intelligence
W
Walter Gerych
Massachusetts Institute of Technology
M
Marzyeh Ghassemi
Massachusetts Institute of Technology