🤖 AI Summary
This study investigates the robustness of medical large language models (LLMs) in clinical decision-making, specifically examining how non-semantic perturbations—such as patient gender, linguistic style, and output format—affect human-AI decision consistency. Method: We introduce MedPerturb, a novel dataset derived from real-world clinical cases, and employ multidimensional controllable text perturbation generation, cross-model evaluation (four LLMs), multi-expert annotation (three clinicians per case), and causal sensitivity analysis to quantify human-AI discrepancies across 800 instances. Contribution/Results: We find that LLMs exhibit significantly higher sensitivity to gender and stylistic perturbations, whereas humans are more susceptible to LLM-typical output formats (e.g., summaries, multi-turn dialogues), revealing a fundamental divergence in robustness mechanisms between humans and LLMs. These findings shift clinical AI evaluation from static correctness toward dynamic scenario adaptability and establish a new assessment paradigm grounded in real-world clinical variability.
📝 Abstract
Clinical robustness is critical to the safe deployment of medical Large Language Models (LLMs), but key questions remain about how LLMs and humans may differ in response to the real-world variability typified by clinical settings. To address this, we introduce MedPerturb, a dataset designed to systematically evaluate medical LLMs under controlled perturbations of clinical input. MedPerturb consists of clinical vignettes spanning a range of pathologies, each transformed along three axes: (1) gender modifications (e.g., gender-swapping or gender-removal); (2) style variation (e.g., uncertain phrasing or colloquial tone); and (3) format changes (e.g., LLM-generated multi-turn conversations or summaries). With MedPerturb, we release a dataset of 800 clinical contexts grounded in realistic input variability, outputs from four LLMs, and three human expert reads per clinical context. We use MedPerturb in two case studies to reveal how shifts in gender identity cues, language style, or format reflect diverging treatment selections between humans and LLMs. We find that LLMs are more sensitive to gender and style perturbations while human annotators are more sensitive to LLM-generated format perturbations such as clinical summaries. Our results highlight the need for evaluation frameworks that go beyond static benchmarks to assess the similarity between human clinician and LLM decisions under the variability characteristic of clinical settings.